Skip to content

WIP: kubelet: Prioritize certain pod status updates#107897

Closed
smarterclayton wants to merge 3 commits into
kubernetes:masterfrom
smarterclayton:pod_status_latency
Closed

WIP: kubelet: Prioritize certain pod status updates#107897
smarterclayton wants to merge 3 commits into
kubernetes:masterfrom
smarterclayton:pod_status_latency

Conversation

@smarterclayton

@smarterclayton smarterclayton commented Feb 1, 2022

Copy link
Copy Markdown
Contributor

Some pod status transitions directly impact end-to-end user latency in the Kubelet, such as pods going ready, going unready, or becoming Succeeded or Failed. Prioritize the order that pods are updated in to minimize that latency.

To make this easier, the status manager reporting mechanism is streamlined to track the set of updated pods instead of using a buffered channel. Remove the time the pod status lock is held by moving other expensive checks out of the loop, which also opens the door for parallelizing the status queue later. Avoid making some checks twice now that syncPod is only called from syncBatch. Protect apiStatusVersions under the pod status lock as well to prevent accidents.

This should prevent head of line blocking where lots of status updates build up in the queue as seen in flakes like:

but have not confirmed that these are HOL.

Builds on #107896

/kind bug

Pod status updates that change the readiness status of a pod, or indicate the pod has succeeded or failed, are prioritized by the Kubelet to reduce end to end latency of some actions.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 1, 2022
Comment thread pkg/kubelet/status/status_manager.go Outdated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This construct is no longer necessary in the new code, because syncPod no longer calls needUpdate to check this value and we directly invoke syncPod only if needed.

Comment thread pkg/kubelet/status/status_manager.go Outdated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value was previously always calculated, but could be invoked twice. Calculate it only once per pod.

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/kubelet sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 1, 2022
Comment thread pkg/kubelet/status/status_manager.go Outdated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performed under the lock in case we decide to parallelize the syncPod method.

Comment thread pkg/kubelet/status/status_manager.go Outdated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was duplicated and is calculated above now.

Comment thread pkg/kubelet/status/status_manager.go Outdated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method holds the read lock so it can check apiStatusVersions in case we parallelize this method.

@smarterclayton

Copy link
Copy Markdown
Contributor Author

This is the simplest possible prioritization but i want to contrast a few different strategies before this would be a serious candidate.

Track how long it takes for pod updates to propagate from detection
to successful change on API server. Will guide future improvements
in pod start and shutdown latency.
Streamline the pod status manager to track the set of updated pods
instead of using a buffered channel. Remove the time the pod status
lock is held by moving other expensive checks out of the loop, which
also opens the door for parallelizing the status queue later. Avoid
making some checks twice now that syncPod is only called from
syncBatch. Protect apiStatusVersions under the pod status lock as
well to prevent accidents.
Some pod status transitions directly impact end-to-end user latency
in the Kubelet, such as pods going ready, going unready, or becoming
Succeeded or Failed.

Prioritize the order that pods are updated in to minimize that
latency.
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

@smarterclayton: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-verify-govet-levee e220a4c link true /test pull-kubernetes-verify-govet-levee
pull-kubernetes-typecheck e220a4c link true /test pull-kubernetes-typecheck
pull-kubernetes-unit e220a4c link true /test pull-kubernetes-unit
pull-kubernetes-verify e220a4c link true /test pull-kubernetes-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@249043822

Copy link
Copy Markdown
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 17, 2022
@dashpole

Copy link
Copy Markdown
Contributor

/assign @logicalhan

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 15, 2022
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

@smarterclayton: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-triage-robot

Copy link
Copy Markdown

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2022
@k8s-triage-robot

Copy link
Copy Markdown

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 15, 2022
@k8s-triage-robot

Copy link
Copy Markdown

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

@k8s-triage-robot: Closed this PR.

Details

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

Development

Successfully merging this pull request may close these issues.

8 participants