DefaultPodTopologySpread graduation to Beta by alculquicondor · Pull Request #2011 · kubernetes/enhancements

alculquicondor · 2020-09-23T21:56:28Z

/sig scheduling

Refs #1258

Includes:

Updated API to account for disabling defaults
Pending implementation details
Production Readiness Questionnaire

alculquicondor · 2020-09-23T21:56:56Z

/assign @Huang-Wei @ahg-g

Will request PRR after sig approval

/hold

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

Huang-Wei · 2020-09-23T22:59:35Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

  The performance should be as close as possible.
+  [Beta] There should not be any significant degradation in the kubemark benchmark for vanilla workloads.
+- **E2E/Conformance Tests**: Test "Multi-AZ Clusters should spread the pods of a {replication controller, service} across zones" should pass.
+  This test is currently broken in 5k nodes.


Is there an issue/link we can share here?

I don't think this test runs in OSS, but GKE runs it internally, there is no public link.

I think we don't run multi-AZ tests in OSS. I did a check in https://k8s-testgrid.appspot.com/ and I couldn't find it.

In OSS we run tests in single-zone clusters only. It's on the roadmap to run nodes across multiple zone (hopefully should happen soon).

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

ahg-g · 2020-09-24T12:05:29Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

  The performance should be as close as possible.
+  [Beta] There should not be any significant degradation in the kubemark benchmark for vanilla workloads.
+- **E2E/Conformance Tests**: Test "Multi-AZ Clusters should spread the pods of a {replication controller, service} across zones" should pass.
+  This test is currently broken in 5k nodes.


I don't think this test runs in OSS, but GKE runs it internally, there is no public link.

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

ahg-g · 2020-09-24T12:11:02Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+
+* **Does enabling the feature change any default behavior?**
+
+  Yes, users might experience more spreading of Pods among Nodes and Zones in certain topology distributions.


clarify that more spreading will be more noticeable in large clusters (over 100 nodes).

ahg-g · 2020-09-24T12:14:11Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+
+* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
+
+  For 100 nodes:


mention minimum master size (I think 4 cores)

ahg-g · 2020-09-24T13:22:02Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+    - Mitigations: Disable the Feature Gate DefaultPodTopologySpreading in kube-scheduler.
+    - Diagnostics: N/A.
+    - Testing: There are performance dashboards.
+  - Node utilization is low, when using cluster-autoscaler.


can you explain how this is related? seems too indirect to mention here.

I would replace this with: "pods of a service/replicaset/statefulset are not properly spread across nodes/zones", this can be detecting by observing the spread of the pods of a service across nodes

Fair. Replaced.

ahg-g · 2020-09-24T19:41:35Z

/lgtm

alculquicondor · 2020-09-24T20:07:00Z

/assign @wojtek-t

wojtek-t · 2020-09-25T11:57:51Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

  The performance should be as close as possible.
+  [Beta] There should not be any significant degradation in the kubemark benchmark for vanilla workloads.
+- **E2E/Conformance Tests**: Test "Multi-AZ Clusters should spread the pods of a {replication controller, service} across zones" should pass.
+  This test is currently broken in 5k nodes.


In OSS we run tests in single-zone clusters only. It's on the roadmap to run nodes across multiple zone (hopefully should happen soon).

wojtek-t · 2020-09-25T12:00:33Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+
+* **Does enabling the feature change any default behavior?**
+
+  Yes. Users might experience more spreading of Pods among Nodes and Zones in certain topology distributions.


What are the default spreading params? How do we group pods for spreading by default?

Also - is the new default spreading preferred or forced (i.e. predicate or priority)?

It's documented in the KEP already #default-constraints and it continues to be scoring (priority)

Got it - can you please cross-link here?

wojtek-t · 2020-09-25T12:04:10Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+  For 100 nodes, with a 4-core master:
+
+  - Latency for PreScore less than 15ms for 99% percentile.
+  - Latency for Score less than 50ms for 99% percentile.


Sounds like quite a lot, isn't it?

I agree, we don't need to break them down I guess, but the total overhead of scoring (pre and actual scoring) should be 15ms at the 99th perecentile and 2ms at the 50th percentile.

I used the numbers from our current performance dashboards. I guess we can include 95th percentile as well. WDYT?

What are the 95th percentiles?

wojtek-t · 2020-09-25T12:04:51Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+  Scheduling time on clusters with more than 100 nodes. Smaller clusters are unaffected.
+  `SelectorSpreading` doesn't take into account all the Nodes in big clusters when calculating skew,
+  resulting in partial spreading at this scale.
+  On the contrary, `PodTopologySpreading` considers all nodes when using topologies bigger than a Node, like a Zone.


Do we have any numbers?

In synthetic unit level benchmarks, difference is negligible. But we will have more precise numbers from dashboards when we enable the feature by default (beta).

Re benchmarks: even in 5k clusters? Are those using the same config we will have in real clusters?

The benchmark is actually a bit outdated, but it's CPU footprint is the same. And the benchmark is for 1000 nodes.
https://github.com/kubernetes/kubernetes/blob/6e3ef0be163566c08398e1e5ff43f87accfb036b/pkg/scheduler/framework/plugins/podtopologyspread/scoring_test.go#L754

This is the legacy counterpart https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/selectorspread/selector_spread_perf_test.go

OK - that's fine.

Can you please add a note to ensure to validate it during graduation?
Something like:
"Before graduation we will ensure that the latency increase is acceptable with Scalability SIG"
(or sth along those lines).

wojtek-t · 2020-09-25T12:05:15Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+* **Will enabling / using this feature result in non-negligible increase of 
+resource usage (CPU, RAM, disk, IO, ...) in any components?**
+
+  kube-scheduler needs to use more CPU to calculate Zone spreading.


Any numbers?

Not sure what to use. Is wall time from synthetic benchmarks enough?

If we don't have anything else - that's at least something.

wojtek-t · 2020-09-28T09:29:23Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+
+* **Does enabling the feature change any default behavior?**
+
+  Yes. Users might experience more spreading of Pods among Nodes and Zones in certain topology distributions.


Got it - can you please cross-link here?

wojtek-t · 2020-09-28T09:33:24Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+
+* **Are there any tests for feature enablement/disablement?**
+
+  There are unit tests in `pkg/scheduler/algorithmprovider/registry_test.go` that exercise the configuration of `kube-scheduler` with the plugins that correspond to the Feature Gate enablement.


They are useful, but these aren't purely enablement/disablement - they are checking the configuration.

The feature enablement/disablement leads to a different configuration.

Let me know if the new wording is more clear, or you are asking for other tests.

wojtek-t · 2020-09-28T09:39:44Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+  For 100 nodes, with a 4-core master:
+
+  - Latency for PreScore less than 15ms for 99% percentile.
+  - Latency for Score less than 50ms for 99% percentile.


What are the 95th percentiles?

wojtek-t · 2020-09-28T09:42:30Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+  Scheduling time on clusters with more than 100 nodes. Smaller clusters are unaffected.
+  `SelectorSpreading` doesn't take into account all the Nodes in big clusters when calculating skew,
+  resulting in partial spreading at this scale.
+  On the contrary, `PodTopologySpreading` considers all nodes when using topologies bigger than a Node, like a Zone.


Re benchmarks: even in 5k clusters? Are those using the same config we will have in real clusters?

wojtek-t · 2020-09-28T09:43:10Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+* **Will enabling / using this feature result in non-negligible increase of 
+resource usage (CPU, RAM, disk, IO, ...) in any components?**
+
+  kube-scheduler needs to use more CPU to calculate Zone spreading.


If we don't have anything else - that's at least something.

wojtek-t · 2020-09-28T09:44:34Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+* **What are other known failure modes?**
+
+  - Pod scheduling is slow
+    - Detection: Pod creation latency is too high.


s/creation/startup time/ ?
Or scheduling?

startup time. That is the first symptom, and then you narrow down from there :)

alculquicondor · 2020-09-30T15:47:52Z

ping @wojtek-t

wojtek-t

One last minor comment - other than that LGTM.

wojtek-t · 2020-10-02T15:55:41Z

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

+  Scheduling time on clusters with more than 100 nodes. Smaller clusters are unaffected.
+  `SelectorSpreading` doesn't take into account all the Nodes in big clusters when calculating skew,
+  resulting in partial spreading at this scale.
+  On the contrary, `PodTopologySpreading` considers all nodes when using topologies bigger than a Node, like a Zone.


OK - that's fine.

Can you please add a note to ensure to validate it during graduation?
Something like:
"Before graduation we will ensure that the latency increase is acceptable with Scalability SIG"
(or sth along those lines).

alculquicondor · 2020-10-02T20:06:35Z

squashed

PRR is done #2011 (review)

ahg-g · 2020-10-02T20:12:23Z

/lgtm
/approve

k8s-ci-robot · 2020-10-02T20:12:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/sig-scheduling/OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alculquicondor · 2020-10-02T22:25:59Z

/hold cancel

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 23, 2020

k8s-ci-robot requested review from Huang-Wei and ahg-g September 23, 2020 21:56

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Sep 23, 2020

k8s-ci-robot assigned ahg-g and Huang-Wei Sep 23, 2020

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2020

alculquicondor force-pushed the default-spread branch 2 times, most recently from a19a80a to a41ef5c Compare September 23, 2020 22:01

Huang-Wei reviewed Sep 23, 2020

View reviewed changes

ahg-g reviewed Sep 24, 2020

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2020

k8s-ci-robot assigned wojtek-t Sep 24, 2020

wojtek-t reviewed Sep 25, 2020

View reviewed changes

wojtek-t reviewed Sep 28, 2020

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 28, 2020

wojtek-t reviewed Oct 2, 2020

View reviewed changes

DefaultPodTopologySpread graduation to Beta

2d59bef

alculquicondor force-pushed the default-spread branch from 03ae86e to 2d59bef Compare October 2, 2020 20:05

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2020

k8s-ci-robot merged commit 40ce94d into kubernetes:master Oct 2, 2020

k8s-ci-robot added this to the v1.20 milestone Oct 2, 2020


		* Does enabling the feature change any default behavior?

		Yes, users might experience more spreading of Pods among Nodes and Zones in certain topology distributions.


		* What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

		For 100 nodes:


		* Does enabling the feature change any default behavior?

		Yes. Users might experience more spreading of Pods among Nodes and Zones in certain topology distributions.


		* Are there any tests for feature enablement/disablement?

		There are unit tests in `pkg/scheduler/algorithmprovider/registry_test.go` that exercise the configuration of `kube-scheduler` with the plugins that correspond to the Feature Gate enablement.

Conversation

alculquicondor commented Sep 23, 2020

Uh oh!

alculquicondor commented Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g commented Sep 24, 2020

Uh oh!

alculquicondor commented Sep 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

alculquicondor commented Sep 23, 2020 •

edited

Loading

ahg-g Sep 25, 2020 •

edited

Loading