[KEP-3521] Part 3: Bug fixes, integration & E2E Test#113442
[KEP-3521] Part 3: Bug fixes, integration & E2E Test#113442k8s-ci-robot merged 3 commits intokubernetes:masterfrom
Conversation
|
|
|
This PR may require stable metrics review. Stable metrics are guaranteed to not change. Please review the documentation for the requirements and lifecycle of stable metrics and ensure that your metrics meet these guidelines. |
|
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
|
/assign @ahg-g |
|
/remove-sig api-machinery |
|
E2E is now only tested locally:
|
There was a problem hiding this comment.
This is good to have for sure, but I assume that podInfo for a gated pod should never have its Timestamp set beyond the first time we observed the pod (on Add(...)) because we should have never attempted to schedule the pod (i.e., Attempts should always be zero), and so Timestamp shouldn't be reset.
There was a problem hiding this comment.
Yes, attempts is always zero, so duration := p.calculateBackoffDuration(podInfo) would always return podInitialBackoffDuration. In other words, if the duration between the pod is added and updated less than podInitialBackoffDuration, isPodBackingOff would return true.
This bug can be observed by running hack/local-up-cluster.sh with the following diff:
diff --git a/hack/local-up-cluster.sh b/hack/local-up-cluster.sh
index 6ec7f1ded28..e2c67dd0e34 100755
--- a/hack/local-up-cluster.sh
+++ b/hack/local-up-cluster.sh
@@ -869,6 +869,8 @@ clientConnection:
kubeconfig: ${CERT_DIR}/scheduler.kubeconfig
leaderElection:
leaderElect: false
+podInitialBackoffSeconds: 120
+podMaxBackoffSeconds: 200
EOF
${CONTROLPLANE_SUDO} "${GO_OUT}/kube-scheduler" \
--v="${LOG_LEVEL}" \|
looks good to me, pls squash |
- test generic integration in plugins_test.go - test integration with SchedulingGates plugin in queue_test.go
|
Oh I need one test owner to approve the |
|
/lgtm for the scheduler |
|
/retest |
| }, | ||
| { | ||
| name: "pod is not admitted to enqueue", | ||
| pod: st.MakePod().Name("p").Namespace(testCtx.NS.Name).SchedulingGates([]string{"foo"}).Container("pause").Obj(), |
There was a problem hiding this comment.
is it a problem that both testcases use a pod with the same name?
There was a problem hiding this comment.
it's fine as in an integration test, each sub-test's context/env is destroyed, and is supposed to run statelessly.
testutils.CleanupPods(testCtx.ClientSet, t, []*v1.Pod{pod})| for _, idx := range tt.rmPodsSchedulingGates { | ||
| patch := `{"spec": {"schedulingGates": null}}` | ||
| podName := tt.pods[idx].Name | ||
| if _, err := cs.CoreV1().Pods(ns).Patch(ctx, podName, types.StrategicMergePatchType, []byte(patch), metav1.PatchOptions{}); err != nil { |
There was a problem hiding this comment.
not an expert on these things, but I can see that the strategy is `merge``
kubernetes/staging/src/k8s.io/api/core/v1/types.go
Lines 3340 to 3343 in 7b6293b
is this patch removing the gates?
There was a problem hiding this comment.
yes, it's removing all scheduling gates.
There was a problem hiding this comment.
yeah, for strategic merge patch;
foo: nulldeletes the whole fieldfoo: [{"name":"bar"}]adds or overwrites the existing entry with name=barfoo: [{"$patch":"delete", "name":"bar"}]deletes the existing entry with name=bar
There was a problem hiding this comment.
the functions says evaluate if a certain amount of pods in given ns are running., but this doesn't checked yet that there are numpods running.
Maybe you can exit fast if len(pods) < num after the Pods().List, this guarantee that nonMatchingPods == 0 => runningPods == num
There was a problem hiding this comment.
oh, thanks for the catch. I should have rewritten it to the style in WaitForPodsSchedulingGated() but somehow missed it. Let me update.
|
/triage accepted |
|
/retest |
|
+1 on the mechanics, I don't have enough knowledge to review the feature and its logic |
|
/approve for tests |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, aojea, Huang-Wei, logicalhan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
carrying over the lgtm since the diff was only about the comments on the tests and is close to code freeze /lgtm |
What type of PR is this?
/kind feature
/sig scheduling
What this PR does / why we need it:
This PR is rebased atop Part 1 (#113274) & Part 2 (#113275), covering the following logic:
Meanwhile, I'm creating a draft PR in test-infra kubernetes/test-infra#27862 to enable it in CI.
Which issue(s) this PR fixes:
Part of #113269
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: