WIP: kubelet: Prioritize certain pod status updates by smarterclayton · Pull Request #107897 · kubernetes/kubernetes

smarterclayton · 2022-02-01T05:19:08Z

Some pod status transitions directly impact end-to-end user latency in the Kubelet, such as pods going ready, going unready, or becoming Succeeded or Failed. Prioritize the order that pods are updated in to minimize that latency.

To make this easier, the status manager reporting mechanism is streamlined to track the set of updated pods instead of using a buffered channel. Remove the time the pod status lock is held by moving other expensive checks out of the loop, which also opens the door for parallelizing the status queue later. Avoid making some checks twice now that syncPod is only called from syncBatch. Protect apiStatusVersions under the pod status lock as well to prevent accidents.

This should prevent head of line blocking where lots of status updates build up in the queue as seen in flakes like:

but have not confirmed that these are HOL.

Builds on #107896

/kind bug

Pod status updates that change the readiness status of a pod, or indicate the pod has succeeded or failed, are prioritized by the Kubelet to reduce end to end latency of some actions.

smarterclayton · 2022-02-01T05:20:58Z

This construct is no longer necessary in the new code, because syncPod no longer calls needUpdate to check this value and we directly invoke syncPod only if needed.

smarterclayton · 2022-02-01T05:22:27Z

This value was previously always calculated, but could be invoked twice. Calculate it only once per pod.

k8s-ci-robot · 2022-02-01T05:22:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2022-02-01T05:23:03Z

Performed under the lock in case we decide to parallelize the syncPod method.

smarterclayton · 2022-02-01T05:23:53Z

This comment was duplicated and is calculated above now.

smarterclayton · 2022-02-01T05:24:19Z

This method holds the read lock so it can check apiStatusVersions in case we parallelize this method.

smarterclayton · 2022-02-01T05:46:57Z

This is the simplest possible prioritization but i want to contrast a few different strategies before this would be a serious candidate.

Track how long it takes for pod updates to propagate from detection to successful change on API server. Will guide future improvements in pod start and shutdown latency.

Streamline the pod status manager to track the set of updated pods instead of using a buffered channel. Remove the time the pod status lock is held by moving other expensive checks out of the loop, which also opens the door for parallelizing the status queue later. Avoid making some checks twice now that syncPod is only called from syncBatch. Protect apiStatusVersions under the pod status lock as well to prevent accidents.

Some pod status transitions directly impact end-to-end user latency in the Kubelet, such as pods going ready, going unready, or becoming Succeeded or Failed. Prioritize the order that pods are updated in to minimize that latency.

k8s-ci-robot · 2022-02-03T04:13:17Z

@smarterclayton: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-verify-govet-levee	`e220a4c`	link	true	`/test pull-kubernetes-verify-govet-levee`
pull-kubernetes-typecheck	`e220a4c`	link	true	`/test pull-kubernetes-typecheck`
pull-kubernetes-unit	`e220a4c`	link	true	`/test pull-kubernetes-unit`
pull-kubernetes-verify	`e220a4c`	link	true	`/test pull-kubernetes-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

249043822 · 2022-02-17T08:54:58Z

/triage accepted

dashpole · 2022-03-10T17:43:40Z

/assign @logicalhan

k8s-ci-robot · 2022-03-17T17:35:15Z

@smarterclayton: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2022-06-15T18:22:09Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-07-15T18:58:10Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-08-14T19:42:05Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-08-14T19:42:23Z

@k8s-triage-robot: Closed this PR.

Details

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

smarterclayton commented Feb 1, 2022

View reviewed changes

k8s-ci-robot requested review from SergeyKanzhelev and sjenning February 1, 2022 05:22

smarterclayton commented Feb 1, 2022

View reviewed changes

smarterclayton force-pushed the pod_status_latency branch from 8e4d42d to 4cb4ed6 Compare February 1, 2022 05:25

smarterclayton mentioned this pull request Feb 1, 2022

Long running actions (CRI, GC, Stats) in Kubelet should pass context.Context #107829

Closed

smarterclayton force-pushed the pod_status_latency branch 3 times, most recently from 6ff507d to f20ab45 Compare February 2, 2022 19:04

smarterclayton added 3 commits February 2, 2022 22:18

kubelet: Record a metric for latency of pod status update

7e3c31a

Track how long it takes for pod updates to propagate from detection to successful change on API server. Will guide future improvements in pod start and shutdown latency.

kubelet: Prioritize certain pod status updates

e220a4c

Some pod status transitions directly impact end-to-end user latency in the Kubelet, such as pods going ready, going unready, or becoming Succeeded or Failed. Prioritize the order that pods are updated in to minimize that latency.

smarterclayton force-pushed the pod_status_latency branch from f20ab45 to e220a4c Compare February 3, 2022 03:19

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 17, 2022

smarterclayton mentioned this pull request Mar 1, 2022

Prevent pods from defaulting to zero second grace periods #102025

Closed

k8s-ci-robot assigned logicalhan Mar 10, 2022

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 15, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 15, 2022

k8s-ci-robot closed this Aug 14, 2022

smarterclayton mentioned this pull request Feb 9, 2023

kubelet: Force deleted pods can fail to move out of terminating #113145

Merged

5 tasks

This was referenced Mar 14, 2023

kubelet: Remove status manager channel #116615

Closed

Pod status updates take longer to propagate to the API than necessary #116617

Closed

Conversation

smarterclayton commented Feb 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smarterclayton Feb 1, 2022

Choose a reason for hiding this comment

Uh oh!

smarterclayton Feb 1, 2022

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Feb 1, 2022

Uh oh!

smarterclayton Feb 1, 2022

Choose a reason for hiding this comment

Uh oh!

smarterclayton Feb 1, 2022

Choose a reason for hiding this comment

Uh oh!

smarterclayton Feb 1, 2022

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Feb 1, 2022

Uh oh!

k8s-ci-robot commented Feb 3, 2022

Uh oh!

249043822 commented Feb 17, 2022

Uh oh!

dashpole commented Mar 10, 2022

Uh oh!

k8s-ci-robot commented Mar 17, 2022

Uh oh!

k8s-triage-robot commented Jun 15, 2022

Uh oh!

k8s-triage-robot commented Jul 15, 2022

Uh oh!

k8s-triage-robot commented Aug 14, 2022

Uh oh!

k8s-ci-robot commented Aug 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

smarterclayton commented Feb 1, 2022 •

edited

Loading