Skip to content

KEP-3521: Add new metric for excluding PreEnqueue duration time#4065

Merged
k8s-ci-robot merged 2 commits into
kubernetes:masterfrom
helayoty:scheduling-gates-3521
Jun 30, 2023
Merged

KEP-3521: Add new metric for excluding PreEnqueue duration time#4065
k8s-ci-robot merged 2 commits into
kubernetes:masterfrom
helayoty:scheduling-gates-3521

Conversation

@helayoty

@helayoty helayoty commented Jun 6, 2023

Copy link
Copy Markdown
Member
  • One-line PR description: Add new metric for excluding PreEnqueue duration time

  • Issue link: Pod Scheduling Readiness #3521

  • Other comments:
    This requirement is part of the discussion related to excluding the PreEnquue duration time in case the pod is gated.

@helayoty helayoty added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Jun 6, 2023
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 6, 2023
@helayoty helayoty force-pushed the scheduling-gates-3521 branch from 0514b12 to 61bfe65 Compare June 6, 2023 19:09
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
Yes, we will need a new metric `pod_scheduling_excluding_preenqueue_duration_seconds` to explicitly differentiate whether the time spent during `PreEnqueue` is included or excluded.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we add a new metric that includes the time spent in preenqueue.

The existing metric shouldn't include preenqueue.

cc @ahg-g

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern might be the old metric is already stable, and users may already use it for SLO, etc... @Huang-Wei raised this here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precisely. As a stable metric, it's behavior shouldn't change.

IMO, having it consider the PreEnqueue time is a breaking change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing metric shouldn't include preenqueue.

I don't think we should interpret the old metric this way. PreEnqueue falls into the category of scheduler extension point, so once some plugin implements it, it should by default assume its time counted as the e2e scheduling time, and by the new metric, they can opt-in to exclude this extension point explicitly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the time a Pod spends in PreEnqueue is outside of the control of the scheduler.

The existing metric is meant to be used by SLIs, now we can't use it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with deprecating the metric and introducing a new one pod_scheduling_sli_duration_seconds, after doing the backport.

The guidance says that stable metrics can be deprecated, but it doesn't say for how long they need to be served before deletion https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/metric-stability.md#stable

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guidance says that stable metrics can be deprecated, but it doesn't say for how long they need to be served before deletion https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/metric-stability.md#stable

4 releases or 16 months

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds we reach to a consensus. @helayoty to proceed we can work on the following tasks:

  • this KEP will be tweaked a bit to introduce a new metric that is equivalent to what current pod_scheduling_duration_seconds behaves - include duration spent on all extension points
  • revisit your PR Unset gated podinfo InitialAttemptTimestamp in addToActiveQ kubernetes#118049 to ensure it's concise enough, as it needs to be back-ported to 1.26 and 1.27
  • create a new, duplicate metric (with pod_scheduling_duration_seconds) called pod_scheduling_sli_duration_seconds, which will explicitly reveals it's a SLI-related metric
  • start the deprecation process for pod_scheduling_duration_seconds

cc @ahg-g

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't usually fix bugs for alpha features. So the only requirement in the second point is to backport to 1.27, if it makes things easier.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KEP has been updated with the new metric.

Signed-off-by: Heba Elayoty <hebaelayoty@gmail.com>
@helayoty helayoty force-pushed the scheduling-gates-3521 branch from 61bfe65 to d7c3edb Compare June 16, 2023 16:37
Comment thread keps/sig-scheduling/3521-pod-scheduling-readiness/README.md Outdated
Signed-off-by: Heba Elayoty <hebaelayoty@gmail.com>
@alculquicondor

Copy link
Copy Markdown
Member

/assign @ahg-g
/unassign

@ahg-g

ahg-g commented Jun 30, 2023

Copy link
Copy Markdown
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2023
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, helayoty

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 30, 2023
@k8s-ci-robot k8s-ci-robot merged commit ab8951b into kubernetes:master Jun 30, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jun 30, 2023
@helayoty helayoty deleted the scheduling-gates-3521 branch July 1, 2023 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants