KEP-3521: Add new metric for excluding PreEnqueue duration time#4065
Conversation
0514b12 to
61bfe65
Compare
| Describe the metrics themselves and the reasons why they weren't added (e.g., cost, | ||
| implementation difficulties, etc.). | ||
| --> | ||
| Yes, we will need a new metric `pod_scheduling_excluding_preenqueue_duration_seconds` to explicitly differentiate whether the time spent during `PreEnqueue` is included or excluded. |
There was a problem hiding this comment.
I would prefer we add a new metric that includes the time spent in preenqueue.
The existing metric shouldn't include preenqueue.
cc @ahg-g
There was a problem hiding this comment.
The concern might be the old metric is already stable, and users may already use it for SLO, etc... @Huang-Wei raised this here.
There was a problem hiding this comment.
Precisely. As a stable metric, it's behavior shouldn't change.
IMO, having it consider the PreEnqueue time is a breaking change.
There was a problem hiding this comment.
The existing metric shouldn't include preenqueue.
I don't think we should interpret the old metric this way. PreEnqueue falls into the category of scheduler extension point, so once some plugin implements it, it should by default assume its time counted as the e2e scheduling time, and by the new metric, they can opt-in to exclude this extension point explicitly.
There was a problem hiding this comment.
But the time a Pod spends in PreEnqueue is outside of the control of the scheduler.
The existing metric is meant to be used by SLIs, now we can't use it.
There was a problem hiding this comment.
I'm ok with deprecating the metric and introducing a new one pod_scheduling_sli_duration_seconds, after doing the backport.
The guidance says that stable metrics can be deprecated, but it doesn't say for how long they need to be served before deletion https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/metric-stability.md#stable
There was a problem hiding this comment.
The guidance says that stable metrics can be deprecated, but it doesn't say for how long they need to be served before deletion https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/metric-stability.md#stable
There was a problem hiding this comment.
Sounds we reach to a consensus. @helayoty to proceed we can work on the following tasks:
- this KEP will be tweaked a bit to introduce a new metric that is equivalent to what current
pod_scheduling_duration_secondsbehaves - include duration spent on all extension points - revisit your PR Unset gated podinfo InitialAttemptTimestamp in addToActiveQ kubernetes#118049 to ensure it's concise enough, as it needs to be back-ported to 1.26 and 1.27
- create a new, duplicate metric (with
pod_scheduling_duration_seconds) calledpod_scheduling_sli_duration_seconds, which will explicitly reveals it's a SLI-related metric - start the deprecation process for
pod_scheduling_duration_seconds
cc @ahg-g
There was a problem hiding this comment.
We don't usually fix bugs for alpha features. So the only requirement in the second point is to backport to 1.27, if it makes things easier.
There was a problem hiding this comment.
KEP has been updated with the new metric.
Signed-off-by: Heba Elayoty <hebaelayoty@gmail.com>
61bfe65 to
d7c3edb
Compare
Signed-off-by: Heba Elayoty <hebaelayoty@gmail.com>
|
/assign @ahg-g |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, helayoty The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
One-line PR description: Add new metric for excluding
PreEnqueueduration timeIssue link: Pod Scheduling Readiness #3521
Other comments:
This requirement is part of the discussion related to excluding the PreEnquue duration time in case the pod is gated.