kubelet: Record a metric for latency of pod status update by smarterclayton · Pull Request #107896 · kubernetes/kubernetes

smarterclayton · 2022-02-01T05:10:09Z

Track how long it takes for pod updates to propagate from detection to successful change on API server. Will guide future improvements in pod start and shutdown latency. Currently pod status updates have been observed to take 30s or more to propagate to the apiserver even on underutilized nodes, and this metric will help quantify the level of latency for status syncs and help identify the need for improvements.

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

A new `pod_status_sync_duration_seconds` histogram is reported at alpha metrics stability that estimates how long the Kubelet takes to write a pod status change once it is detected.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

binacs

/retest

fedebongio · 2022-02-03T17:34:44Z

/remove-sig api-machinery

ehashman · 2022-02-10T17:43:38Z

/assign @dashpole
for instrumentation

/triage accepted
/priority backlog

/remove-kind api-change bug
/kind feature

dashpole

lgtm overall.

dashpole · 2022-02-10T19:39:44Z

+	} else {
+		duration = time.Now().Sub(status.at).Truncate(time.Millisecond)
+	}
+	metrics.PodStatusSyncDuration.WithLabelValues(strconv.Itoa(0)).Observe(duration.Seconds())


Should we only observe a duration if status.at was non-zero?

Good question, it's typically an indicator that something unexpected happened, but it would skew the numbers. Let me review.

I'll only observe if it's set, and then verify there are no wierd cases where a synthetic status is created that somehow escapes time (can't be 100% sure, but log observation in runs should be sufficient).

dashpole · 2022-02-10T19:45:28Z

 		podNamespace: pod.Namespace,
 	}
+
+	if cachedStatus.at.IsZero() {


This happens only when when we don't have a cached status in podStatuses, right?

Correct, which I'm not actually sure can happen except during some sort of invalidation. Will look though.

Also during Kubelet restarts we will have a clean cache in status manager until every pod worker has started up and invoked SetPodStatus at least once.

dashpole · 2022-02-10T19:46:59Z

+	if cachedStatus.at.IsZero() {
+		newStatus.at = time.Now()
+	} else {
+		newStatus.at = cachedStatus.at


This essentially ensures we measure the longest outstanding status update for a pod, right? E.g. if multiple status updates end up batched, we want to measure how long it took from the first update.

Can you add a brief comment?

dashpole · 2022-02-10T19:49:12Z

+			Buckets:        []float64{0.010, 0.050, 0.100, 0.500, 1, 5, 10, 20, 30, 45, 60},
+			StabilityLevel: metrics.ALPHA,
+		},
+		[]string{"priority"},


I assume you have a good reason for wanting to breakdown by priority :). Can you document that in the PR description?

Priority was a follow on concept I was testing (some status updates like transition from unready to ready or running to failed are "more important" than updates like a container message being propagated).

I'll remove it from this PR and introduce it as a separate commit in a subsequent PR.

k8s-triage-robot · 2022-06-15T18:22:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

smarterclayton · 2022-06-16T18:11:21Z

/remove-lifecycle stale

ritazh · 2022-07-25T16:10:23Z

/remove-sig auth
Feel free to add us back if needed.

linux-foundation-easycla · 2022-08-15T17:19:32Z

The committers listed above are authorized under a signed CLA.

✅ login: smarterclayton / name: Clayton Coleman (e7710053ec514abf73294055145668fdcb1e4cdf)

k8s-ci-robot · 2022-08-15T21:04:24Z

@smarterclayton: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-e2e-gce-network-proxy-grpc	5cf4fb768d206270f04478202cad7f3661cbc412	link	false	`/test pull-kubernetes-e2e-gce-network-proxy-grpc`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Track how long it takes for pod updates to propagate from detection to successful change on API server. Will guide future improvements in pod start and shutdown latency. Metric is `kubelet_pod_status_sync_duration_seconds` and is ALPHA stability. Histogram buckets are chosen based on distribution of observed status delays in practice.

smarterclayton · 2022-09-09T17:31:55Z

Ok, I think this is ready for re-review with all comments addressed, PTAL when you have the chance.

dashpole

/lgtm

k8s-ci-robot · 2022-09-27T20:19:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BinacsLee, dashpole, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from dchen1107 and mtaufen February 1, 2022 05:11

smarterclayton force-pushed the track_pod_sync_latency branch from e4c2b6c to ad00f6e Compare February 1, 2022 05:15

smarterclayton mentioned this pull request Feb 1, 2022

WIP: kubelet: Prioritize certain pod status updates #107897

Closed

smarterclayton force-pushed the track_pod_sync_latency branch 2 times, most recently from 2b89a36 to 5cf4fb7 Compare February 1, 2022 23:15

smarterclayton force-pushed the track_pod_sync_latency branch 4 times, most recently from e4d7d0a to 7e3c31a Compare February 3, 2022 03:18

binacs approved these changes Feb 3, 2022

View reviewed changes

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 3, 2022

k8s-ci-robot assigned dashpole Feb 10, 2022

dashpole reviewed Feb 10, 2022

View reviewed changes

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 15, 2022

bobbypage mentioned this pull request Sep 8, 2022

[Node Graceful Shutdown] kubelet sometimes doesn't finish kill pods before node shutdown within 30s and left the pods in running state still #110755

Open

dashpole approved these changes Sep 27, 2022

View reviewed changes

Conversation

smarterclayton commented Feb 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

binacs left a comment

Choose a reason for hiding this comment

Uh oh!

fedebongio commented Feb 3, 2022

Uh oh!

ehashman commented Feb 10, 2022

Uh oh!

dashpole left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-triage-robot commented Jun 15, 2022

Uh oh!

smarterclayton commented Jun 16, 2022

Uh oh!

ritazh commented Jul 25, 2022

Uh oh!

linux-foundation-easycla Bot commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smarterclayton commented Sep 9, 2022

Uh oh!

dashpole left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Sep 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

smarterclayton commented Feb 1, 2022 •

edited

Loading

linux-foundation-easycla Bot commented Aug 15, 2022 •

edited

Loading

k8s-ci-robot commented Aug 15, 2022 •

edited

Loading