Skip to content

Gang Termination Doesn't Work #277

@gflarity

Description

@gflarity

What happened?

The newly created E2E tests are failing for gang termination. The root cause is that the E2E tests use cordoning nodes and killing pods afterwards to creates breaches. Currently when podclique status is reconciled, we avoid breaching if scheduledReplicas < minAvailable. So the test cases don't actually trigger a breach as intended:

// If the number of scheduled pods is less than the minimum available, then minAvailable is not considered as breached.
	// Consider a case where none of the PodCliques have been scheduled yet, then it should not cause the PodGang to be recreated all the time.
	if scheduledReplicas < minAvailable {
		return metav1.Condition{
			Type:               constants.ConditionTypeMinAvailableBreached,
			Status:             metav1.ConditionFalse,
			Reason:             constants.ConditionReasonInsufficientScheduledPods,
			Message:            fmt.Sprintf("Insufficient scheduled pods. expected at least: %d, found: %d", minAvailable, scheduledReplicas),
			LastTransitionTime: now,
		}
	}

This also happens inside the PCSG status reconciliation as well:

	if scheduledReplicas < minAvailable {
		return metav1.Condition{
			Type:    constants.ConditionTypeMinAvailableBreached,
			Status:  metav1.ConditionFalse,
			Reason:  constants.ConditionReasonInsufficientScheduledPCSGReplicas,
			Message: fmt.Sprintf("Insufficient scheduled replicas. expected at least: %d, found: %d", minAvailable, scheduledReplicas),
		}
	}

I tried to a couple of "easy" fixes just to see. They didn't work for various reasons but maybe someone else has the needed context.

Here's a quick summary of things tried:

One-Way Ratchet

Once a PCLQ is healthy, track that it achieved this in it's conditions. This didn't work because podcliques were getting re-created in certain situations, this status gets lost as well.

Grace Period

I tried implementing a grace period, during which pods are not marked as in breach, but afterwards they are. There ended up being a number of edge cases and the tests wouldn't pass.

What did you expect to happen?

No response

Environment

  • Kubernetes version
  • Grove version
  • Scheduler details
  • Cloud provider or hardware configuration
  • Tools that you are using Grove together with
  • Anything else that is relevant

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions