What happened?
The newly created E2E tests are failing for gang termination. The root cause is that the E2E tests use cordoning nodes and killing pods afterwards to creates breaches. Currently when podclique status is reconciled, we avoid breaching if scheduledReplicas < minAvailable. So the test cases don't actually trigger a breach as intended:
// If the number of scheduled pods is less than the minimum available, then minAvailable is not considered as breached.
// Consider a case where none of the PodCliques have been scheduled yet, then it should not cause the PodGang to be recreated all the time.
if scheduledReplicas < minAvailable {
return metav1.Condition{
Type: constants.ConditionTypeMinAvailableBreached,
Status: metav1.ConditionFalse,
Reason: constants.ConditionReasonInsufficientScheduledPods,
Message: fmt.Sprintf("Insufficient scheduled pods. expected at least: %d, found: %d", minAvailable, scheduledReplicas),
LastTransitionTime: now,
}
}
This also happens inside the PCSG status reconciliation as well:
if scheduledReplicas < minAvailable {
return metav1.Condition{
Type: constants.ConditionTypeMinAvailableBreached,
Status: metav1.ConditionFalse,
Reason: constants.ConditionReasonInsufficientScheduledPCSGReplicas,
Message: fmt.Sprintf("Insufficient scheduled replicas. expected at least: %d, found: %d", minAvailable, scheduledReplicas),
}
}
I tried to a couple of "easy" fixes just to see. They didn't work for various reasons but maybe someone else has the needed context.
Here's a quick summary of things tried:
One-Way Ratchet
Once a PCLQ is healthy, track that it achieved this in it's conditions. This didn't work because podcliques were getting re-created in certain situations, this status gets lost as well.
Grace Period
I tried implementing a grace period, during which pods are not marked as in breach, but afterwards they are. There ended up being a number of edge cases and the tests wouldn't pass.
What did you expect to happen?
No response
Environment
- Kubernetes version
- Grove version
- Scheduler details
- Cloud provider or hardware configuration
- Tools that you are using Grove together with
- Anything else that is relevant
What happened?
The newly created E2E tests are failing for gang termination. The root cause is that the E2E tests use cordoning nodes and killing pods afterwards to creates breaches. Currently when podclique status is reconciled, we avoid breaching if scheduledReplicas < minAvailable. So the test cases don't actually trigger a breach as intended:
This also happens inside the PCSG status reconciliation as well:
I tried to a couple of "easy" fixes just to see. They didn't work for various reasons but maybe someone else has the needed context.
Here's a quick summary of things tried:
One-Way Ratchet
Once a PCLQ is healthy, track that it achieved this in it's conditions. This didn't work because podcliques were getting re-created in certain situations, this status gets lost as well.
Grace Period
I tried implementing a grace period, during which pods are not marked as in breach, but afterwards they are. There ended up being a number of edge cases and the tests wouldn't pass.
What did you expect to happen?
No response
Environment