DefaultPodTopologySpread graduation to Beta#2011
DefaultPodTopologySpread graduation to Beta#2011k8s-ci-robot merged 1 commit intokubernetes:masterfrom
Conversation
|
/assign @Huang-Wei @ahg-g Will request PRR after sig approval /hold |
a19a80a to
a41ef5c
Compare
| The performance should be as close as possible. | ||
| [Beta] There should not be any significant degradation in the kubemark benchmark for vanilla workloads. | ||
| - **E2E/Conformance Tests**: Test "Multi-AZ Clusters should spread the pods of a {replication controller, service} across zones" should pass. | ||
| This test is currently broken in 5k nodes. |
There was a problem hiding this comment.
Is there an issue/link we can share here?
There was a problem hiding this comment.
I don't think this test runs in OSS, but GKE runs it internally, there is no public link.
There was a problem hiding this comment.
I think we don't run multi-AZ tests in OSS. I did a check in https://k8s-testgrid.appspot.com/ and I couldn't find it.
There was a problem hiding this comment.
In OSS we run tests in single-zone clusters only. It's on the roadmap to run nodes across multiple zone (hopefully should happen soon).
| The performance should be as close as possible. | ||
| [Beta] There should not be any significant degradation in the kubemark benchmark for vanilla workloads. | ||
| - **E2E/Conformance Tests**: Test "Multi-AZ Clusters should spread the pods of a {replication controller, service} across zones" should pass. | ||
| This test is currently broken in 5k nodes. |
There was a problem hiding this comment.
I don't think this test runs in OSS, but GKE runs it internally, there is no public link.
|
|
||
| * **Does enabling the feature change any default behavior?** | ||
|
|
||
| Yes, users might experience more spreading of Pods among Nodes and Zones in certain topology distributions. |
There was a problem hiding this comment.
clarify that more spreading will be more noticeable in large clusters (over 100 nodes).
|
|
||
| * **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** | ||
|
|
||
| For 100 nodes: |
There was a problem hiding this comment.
mention minimum master size (I think 4 cores)
| - Mitigations: Disable the Feature Gate DefaultPodTopologySpreading in kube-scheduler. | ||
| - Diagnostics: N/A. | ||
| - Testing: There are performance dashboards. | ||
| - Node utilization is low, when using cluster-autoscaler. |
There was a problem hiding this comment.
can you explain how this is related? seems too indirect to mention here.
I would replace this with: "pods of a service/replicaset/statefulset are not properly spread across nodes/zones", this can be detecting by observing the spread of the pods of a service across nodes
|
/lgtm |
|
/assign @wojtek-t |
| The performance should be as close as possible. | ||
| [Beta] There should not be any significant degradation in the kubemark benchmark for vanilla workloads. | ||
| - **E2E/Conformance Tests**: Test "Multi-AZ Clusters should spread the pods of a {replication controller, service} across zones" should pass. | ||
| This test is currently broken in 5k nodes. |
There was a problem hiding this comment.
In OSS we run tests in single-zone clusters only. It's on the roadmap to run nodes across multiple zone (hopefully should happen soon).
|
|
||
| * **Does enabling the feature change any default behavior?** | ||
|
|
||
| Yes. Users might experience more spreading of Pods among Nodes and Zones in certain topology distributions. |
There was a problem hiding this comment.
What are the default spreading params? How do we group pods for spreading by default?
There was a problem hiding this comment.
Also - is the new default spreading preferred or forced (i.e. predicate or priority)?
There was a problem hiding this comment.
It's documented in the KEP already #default-constraints and it continues to be scoring (priority)
There was a problem hiding this comment.
Got it - can you please cross-link here?
| For 100 nodes, with a 4-core master: | ||
|
|
||
| - Latency for PreScore less than 15ms for 99% percentile. | ||
| - Latency for Score less than 50ms for 99% percentile. |
There was a problem hiding this comment.
Sounds like quite a lot, isn't it?
There was a problem hiding this comment.
I agree, we don't need to break them down I guess, but the total overhead of scoring (pre and actual scoring) should be 15ms at the 99th perecentile and 2ms at the 50th percentile.
There was a problem hiding this comment.
I used the numbers from our current performance dashboards. I guess we can include 95th percentile as well. WDYT?
There was a problem hiding this comment.
What are the 95th percentiles?
| Scheduling time on clusters with more than 100 nodes. Smaller clusters are unaffected. | ||
| `SelectorSpreading` doesn't take into account all the Nodes in big clusters when calculating skew, | ||
| resulting in partial spreading at this scale. | ||
| On the contrary, `PodTopologySpreading` considers all nodes when using topologies bigger than a Node, like a Zone. |
There was a problem hiding this comment.
In synthetic unit level benchmarks, difference is negligible. But we will have more precise numbers from dashboards when we enable the feature by default (beta).
There was a problem hiding this comment.
Re benchmarks: even in 5k clusters? Are those using the same config we will have in real clusters?
There was a problem hiding this comment.
The benchmark is actually a bit outdated, but it's CPU footprint is the same. And the benchmark is for 1000 nodes.
https://github.com/kubernetes/kubernetes/blob/6e3ef0be163566c08398e1e5ff43f87accfb036b/pkg/scheduler/framework/plugins/podtopologyspread/scoring_test.go#L754
This is the legacy counterpart https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/selectorspread/selector_spread_perf_test.go
There was a problem hiding this comment.
OK - that's fine.
Can you please add a note to ensure to validate it during graduation?
Something like:
"Before graduation we will ensure that the latency increase is acceptable with Scalability SIG"
(or sth along those lines).
| * **Will enabling / using this feature result in non-negligible increase of | ||
| resource usage (CPU, RAM, disk, IO, ...) in any components?** | ||
|
|
||
| kube-scheduler needs to use more CPU to calculate Zone spreading. |
There was a problem hiding this comment.
Not sure what to use. Is wall time from synthetic benchmarks enough?
There was a problem hiding this comment.
If we don't have anything else - that's at least something.
|
|
||
| * **Does enabling the feature change any default behavior?** | ||
|
|
||
| Yes. Users might experience more spreading of Pods among Nodes and Zones in certain topology distributions. |
There was a problem hiding this comment.
Got it - can you please cross-link here?
|
|
||
| * **Are there any tests for feature enablement/disablement?** | ||
|
|
||
| There are unit tests in `pkg/scheduler/algorithmprovider/registry_test.go` that exercise the configuration of `kube-scheduler` with the plugins that correspond to the Feature Gate enablement. |
There was a problem hiding this comment.
They are useful, but these aren't purely enablement/disablement - they are checking the configuration.
There was a problem hiding this comment.
The feature enablement/disablement leads to a different configuration.
Let me know if the new wording is more clear, or you are asking for other tests.
| For 100 nodes, with a 4-core master: | ||
|
|
||
| - Latency for PreScore less than 15ms for 99% percentile. | ||
| - Latency for Score less than 50ms for 99% percentile. |
There was a problem hiding this comment.
What are the 95th percentiles?
| Scheduling time on clusters with more than 100 nodes. Smaller clusters are unaffected. | ||
| `SelectorSpreading` doesn't take into account all the Nodes in big clusters when calculating skew, | ||
| resulting in partial spreading at this scale. | ||
| On the contrary, `PodTopologySpreading` considers all nodes when using topologies bigger than a Node, like a Zone. |
There was a problem hiding this comment.
Re benchmarks: even in 5k clusters? Are those using the same config we will have in real clusters?
| * **Will enabling / using this feature result in non-negligible increase of | ||
| resource usage (CPU, RAM, disk, IO, ...) in any components?** | ||
|
|
||
| kube-scheduler needs to use more CPU to calculate Zone spreading. |
There was a problem hiding this comment.
If we don't have anything else - that's at least something.
| * **What are other known failure modes?** | ||
|
|
||
| - Pod scheduling is slow | ||
| - Detection: Pod creation latency is too high. |
There was a problem hiding this comment.
s/creation/startup time/ ?
Or scheduling?
There was a problem hiding this comment.
startup time. That is the first symptom, and then you narrow down from there :)
|
ping @wojtek-t |
wojtek-t
left a comment
There was a problem hiding this comment.
One last minor comment - other than that LGTM.
| Scheduling time on clusters with more than 100 nodes. Smaller clusters are unaffected. | ||
| `SelectorSpreading` doesn't take into account all the Nodes in big clusters when calculating skew, | ||
| resulting in partial spreading at this scale. | ||
| On the contrary, `PodTopologySpreading` considers all nodes when using topologies bigger than a Node, like a Zone. |
There was a problem hiding this comment.
OK - that's fine.
Can you please add a note to ensure to validate it during graduation?
Something like:
"Before graduation we will ensure that the latency increase is acceptable with Scalability SIG"
(or sth along those lines).
03ae86e to
2d59bef
Compare
|
squashed PRR is done #2011 (review) |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, alculquicondor The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
/sig scheduling
Refs #1258
Includes: