Gang Termination Doesn't Work

### What happened?

The newly created E2E tests are failing for gang termination.  The root cause is that the E2E tests use cordoning nodes and killing pods afterwards to creates breaches. Currently when podclique status is reconciled,  we avoid breaching if scheduledReplicas < minAvailable.  So the test cases don't actually trigger a breach as intended:

```
// If the number of scheduled pods is less than the minimum available, then minAvailable is not considered as breached.
	// Consider a case where none of the PodCliques have been scheduled yet, then it should not cause the PodGang to be recreated all the time.
	if scheduledReplicas < minAvailable {
		return metav1.Condition{
			Type:               constants.ConditionTypeMinAvailableBreached,
			Status:             metav1.ConditionFalse,
			Reason:             constants.ConditionReasonInsufficientScheduledPods,
			Message:            fmt.Sprintf("Insufficient scheduled pods. expected at least: %d, found: %d", minAvailable, scheduledReplicas),
			LastTransitionTime: now,
		}
	}
```

This also happens inside the PCSG status reconciliation as well:

```
	if scheduledReplicas < minAvailable {
		return metav1.Condition{
			Type:    constants.ConditionTypeMinAvailableBreached,
			Status:  metav1.ConditionFalse,
			Reason:  constants.ConditionReasonInsufficientScheduledPCSGReplicas,
			Message: fmt.Sprintf("Insufficient scheduled replicas. expected at least: %d, found: %d", minAvailable, scheduledReplicas),
		}
	}
```

I tried to a couple of "easy" fixes just to see. They didn't work for various reasons but maybe someone else has the needed context. 

Here's a quick summary of things tried:

## One-Way Ratchet
Once a PCLQ is healthy, track that it achieved this in it's conditions.  This didn't work because podcliques were getting re-created in certain situations, this status gets lost as well. 

## Grace Period
I tried implementing a grace period, during which pods are not marked as in breach, but afterwards they are. There ended up being a number of edge cases and the tests wouldn't pass. 

### What did you expect to happen?

_No response_

### Environment

- Kubernetes version
- Grove version
- Scheduler details
- Cloud provider or hardware configuration
- Tools that you are using Grove together with
- Anything else that is relevant


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gang Termination Doesn't Work #277

What happened?

One-Way Ratchet

Grace Period

What did you expect to happen?

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Gang Termination Doesn't Work #277

Description

What happened?

One-Way Ratchet

Grace Period

What did you expect to happen?

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions