Skip to content

kubelet: Record a metric for latency of pod status update#107896

Merged
k8s-ci-robot merged 1 commit into
kubernetes:masterfrom
smarterclayton:track_pod_sync_latency
Sep 27, 2022
Merged

kubelet: Record a metric for latency of pod status update#107896
k8s-ci-robot merged 1 commit into
kubernetes:masterfrom
smarterclayton:track_pod_sync_latency

Conversation

@smarterclayton

@smarterclayton smarterclayton commented Feb 1, 2022

Copy link
Copy Markdown
Contributor

Track how long it takes for pod updates to propagate from detection to successful change on API server. Will guide future improvements in pod start and shutdown latency. Currently pod status updates have been observed to take 30s or more to propagate to the apiserver even on underutilized nodes, and this metric will help quantify the level of latency for status syncs and help identify the need for improvements.

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

A new `pod_status_sync_duration_seconds` histogram is reported at alpha metrics stability that estimates how long the Kubelet takes to write a pod status change once it is detected.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 1, 2022
@smarterclayton smarterclayton force-pushed the track_pod_sync_latency branch 2 times, most recently from 2b89a36 to 5cf4fb7 Compare February 1, 2022 23:15
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. area/apiserver area/cloudprovider area/code-generation area/conformance Issues or PRs related to kubernetes conformance tests area/dependency Issues or PRs related to dependency changes area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubeadm and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 1, 2022
@smarterclayton smarterclayton force-pushed the track_pod_sync_latency branch 4 times, most recently from e4d7d0a to 7e3c31a Compare February 3, 2022 03:18

@binacs binacs left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/retest

@fedebongio

Copy link
Copy Markdown
Contributor

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 3, 2022
@ehashman

Copy link
Copy Markdown
Member

/assign @dashpole
for instrumentation

/triage accepted
/priority backlog

/remove-kind api-change bug
/kind feature

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. kind/feature Categorizes issue or PR as related to a new feature. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. labels Feb 10, 2022

@dashpole dashpole left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall.

Comment thread pkg/kubelet/status/status_manager.go Outdated
} else {
duration = time.Now().Sub(status.at).Truncate(time.Millisecond)
}
metrics.PodStatusSyncDuration.WithLabelValues(strconv.Itoa(0)).Observe(duration.Seconds())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only observe a duration if status.at was non-zero?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, it's typically an indicator that something unexpected happened, but it would skew the numbers. Let me review.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll only observe if it's set, and then verify there are no wierd cases where a synthetic status is created that somehow escapes time (can't be 100% sure, but log observation in runs should be sufficient).

podNamespace: pod.Namespace,
}

if cachedStatus.at.IsZero() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens only when when we don't have a cached status in podStatuses, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, which I'm not actually sure can happen except during some sort of invalidation. Will look though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also during Kubelet restarts we will have a clean cache in status manager until every pod worker has started up and invoked SetPodStatus at least once.

if cachedStatus.at.IsZero() {
newStatus.at = time.Now()
} else {
newStatus.at = cachedStatus.at

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This essentially ensures we measure the longest outstanding status update for a pod, right? E.g. if multiple status updates end up batched, we want to measure how long it took from the first update.

Can you add a brief comment?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Comment thread pkg/kubelet/metrics/metrics.go Outdated
Buckets: []float64{0.010, 0.050, 0.100, 0.500, 1, 5, 10, 20, 30, 45, 60},
StabilityLevel: metrics.ALPHA,
},
[]string{"priority"},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you have a good reason for wanting to breakdown by priority :). Can you document that in the PR description?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority was a follow on concept I was testing (some status updates like transition from unready to ready or running to failed are "more important" than updates like a container message being propagated).

I'll remove it from this PR and introduce it as a separate commit in a subsequent PR.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 15, 2022
@k8s-triage-robot

Copy link
Copy Markdown

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@smarterclayton

Copy link
Copy Markdown
Contributor Author

/remove-lifecycle stale

@ritazh

ritazh commented Jul 25, 2022

Copy link
Copy Markdown
Member

/remove-sig auth
Feel free to add us back if needed.

@linux-foundation-easycla

linux-foundation-easycla Bot commented Aug 15, 2022

Copy link
Copy Markdown

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: smarterclayton / name: Clayton Coleman (e7710053ec514abf73294055145668fdcb1e4cdf)

@k8s-ci-robot

k8s-ci-robot commented Aug 15, 2022

Copy link
Copy Markdown
Contributor

@smarterclayton: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-gce-network-proxy-grpc 5cf4fb768d206270f04478202cad7f3661cbc412 link false /test pull-kubernetes-e2e-gce-network-proxy-grpc

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Track how long it takes for pod updates to propagate from detection
to successful change on API server. Will guide future improvements
in pod start and shutdown latency.

Metric is `kubelet_pod_status_sync_duration_seconds` and is ALPHA
stability. Histogram buckets are chosen based on distribution of
observed status delays in practice.
@smarterclayton

Copy link
Copy Markdown
Contributor Author

Ok, I think this is ready for re-review with all comments addressed, PTAL when you have the chance.

@dashpole dashpole left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BinacsLee, dashpole, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/cloudprovider area/code-generation area/conformance Issues or PRs related to kubernetes conformance tests area/dependency Issues or PRs related to dependency changes area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubectl area/kubelet area/provider/gcp Issues or PRs related to gcp provider area/release-eng Issues or PRs related to the Release Engineering subproject area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/structured-logging Categorizes an issue or PR as relevant to WG Structured Logging.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.