client-go: do not print the request latency logs at level 0 by default by SataQiu · Pull Request #101634 · kubernetes/kubernetes

SataQiu · 2021-04-30T02:45:20Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Printing the latency logs by default will break the user's original behavior, especially if they are doing something with the command output.
IMO, we should print this message at a low level, but not at the default level of 0.

This PR:

print the request latency logs when log level 2 is enabled
add more test cases for throttledLogger

Which issue(s) this PR fixes:

Fixes #101631
Ref #88134 (before this PR, we print the log at level 2)

Special notes for your reviewer:

Does this PR introduce a user-facing change?

client-go: do not print the request latency logs at level 0 by default

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

SataQiu · 2021-04-30T03:09:10Z

/priority important-soon
/sig api-machinery

SataQiu · 2021-04-30T03:10:36Z

/sig instrumentation

SataQiu · 2021-04-30T03:14:10Z

/cc @logicalhan

SataQiu · 2021-04-30T08:01:03Z

/test pull-kubernetes-e2e-gce-ubuntu-containerd

logicalhan

I'm confused by this PR. From what I can tell you are changing a throttling setting, which shouldn't affect which level request latency logs are printed at. This change, IIUC, should just mean that instead of throttling at logs of V(0) we are throttling V(1) logs.

jonnylangefeld · 2021-05-04T04:29:53Z

@logicalhan This change doesn't look like it's changing the throttling setting, but rather the log level at which the globalThrottledLogger logs.
This change
https://github.com/kubernetes/kubernetes/blob/f7bd79539240c57bd1a99795526880914e9e228b/staging/src/k8s.io/client-go/rest/request.go#L636
influences which logLevel is used for the this line
https://github.com/kubernetes/kubernetes/blob/f7bd79539240c57bd1a99795526880914e9e228b/staging/src/k8s.io/client-go/rest/request.go#L668
And the origin of the log message is ultimately here:
https://github.com/kubernetes/kubernetes/blob/f7bd79539240c57bd1a99795526880914e9e228b/staging/src/k8s.io/client-go/rest/request.go#L605

And it's true, it was last changed in the PR #88134 that @SataQiu linked. One commit before that the general log level for the globalThrottledLogger was set to 2 here:
https://github.com/jennybuckley/kubernetes/blob/0dfe0c793b5b85bd85cec7b77679ba78c1617357/staging/src/k8s.io/client-go/rest/request.go#L559

Since I updated my kubectl to include commit #88134 in my build I'm getting a lot of logs of the form

Waited for 1.140352693s due to client-side throttling, not priority and fairness, request: GET:https://10.212.0.242/apis/storage.k8s.io/v1beta1?timeout=32s

When querying any resource.

Hope this explanation helps and I give this a +1 to merge because this info is not useful for most api clients. And for the ones that need a bit more verbosity it will still show up when set verbosity to 1.

caesarxuchao · 2021-05-04T20:11:39Z

/triage accept
/assign @logicalhan

k8s-ci-robot · 2021-05-04T20:11:41Z

@caesarxuchao: The label(s) triage/accept cannot be applied, because the repository doesn't have them.

Details

In response to this:

/triage accept
/assign @logicalhan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

caesarxuchao · 2021-05-04T20:22:34Z

/triage accepted

logicalhan

/lgtm
/approve

k8s-ci-robot · 2021-05-07T15:04:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: logicalhan, SataQiu
To complete the pull request process, please assign yliaog after the PR has been reviewed.
You can assign the PR to them by writing /assign @yliaog in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

staging/src/k8s.io/client-go/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SataQiu · 2021-05-08T07:02:39Z

/assign @liggitt

liggitt · 2021-05-10T13:25:13Z

/unassign
/assign @lavalamp

looks like #88134 cited #87740 (comment) as the reason for the higher log level

liggitt · 2021-05-10T13:26:01Z

Printing the latency logs by default will break the user's original behavior, especially if they are doing something with the command output.

It prints to stderr, right? There are many things that can print to stderr, I would not expect structured output to be passed via that output

jonnylangefeld · 2021-05-10T16:32:20Z

@liggitt Maybe V(1) and V(2) would be appropriate log levels (instead of V(0) and V(2)).

caesarxuchao · 2021-05-11T20:23:18Z

cc @deads2k

deads2k · 2021-05-11T21:02:39Z

TBH, the messages have been useful to us at their current level because component authors find them and wonder why they see them. I'm biased towards keeping delays of over one second as info level messages.

@SataQiu do you have examples of healthy components (no hot or warm loops) with properly configured client-side rate limits seeing this message unnecessarily? It seems like these messages at an info level would be helpful, not harmful.

SataQiu · 2021-05-13T07:48:26Z

Printing the latency logs by default will break the user's original behavior, especially if they are doing something with the command output.

It prints to stderr, right? There are many things that can print to stderr, I would not expect structured output to be passed via that output

Emm...
As @liggitt said, there should be a workaround(such as kubectl get pod 2>/dev/null) to shutting out additional output.

/close

k8s-ci-robot · 2021-05-13T07:48:36Z

@SataQiu: Closed this PR.

Details

In response to this:

Printing the latency logs by default will break the user's original behavior, especially if they are doing something with the command output.

It prints to stderr, right? There are many things that can print to stderr, I would not expect structured output to be passed via that output

Emm...
As @liggitt said, there should be a workaround(such as kubectl get pod 2>/dev/null) to shutting out additional output.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lavalamp · 2021-05-13T16:54:31Z

My guess is that most people don't know they are using client side rate limits and on average printing this log message saves 10 hours of debugging for every hour it costs someone trying to hide it.

jonnylangefeld · 2021-10-02T17:58:56Z

@lavalamp since everyone finds these messages helpful maybe I'm just interpreting them wrong. And rather than wanting to hide them I should figure out what causes them.
What could I do to eliminate the cause for this client-side throttling as a platform owner?

Edit: I know that our clusters have many CRDs (mainly through cloud vendor provisioners like Google Config Connector) and I know it has to do with that amount of CRDs and the discovery cache that kubernetes keeps around. But I don't know how to improve on the situation and how to not constantly run into rate-limiting...

$ kubectl get crd | awk 'NR>1' | wc -l
     225

lavalamp · 2021-10-04T16:13:15Z

@jonnylangefeld In code you can modify the rest.Config struct fields QPS and Burst. -1 disables, 0 chooses a default (50qps IIRC). Now that APF is defaulted on, I think most users can just disable client side limits and let the server worry about it (exception is if there are many watchers of the resources you are modifying, e.g. endpoints).

We were going to add these to kubeconfig but didn't because it's only applicable to the standard go client: kubernetes/enhancements#1630

Some binaries (kube-controller-manager) might expose command line flags for these.

jonnylangefeld · 2021-10-04T17:22:08Z

Thanks for the explanation! hmm so for kubectl that would mean we maintain our own fork and build ourselves?

lavalamp · 2021-10-04T18:28:51Z

How are you hitting this limit with kubectl?

jonnylangefeld · 2021-10-04T21:00:57Z

It seems like that has to do with the discovery cache. If I run kubectl get pod -v 8 I see 100s of requests made. They're all of this sort:

GET https://<host>/apis/dns.cnrm.cloud.google.com/v1beta1?timeout=32s
GET https://<host>/apis/templates.gatekeeper.sh/v1alpha1?timeout=32s
GET https://<host>/apis/firestore.cnrm.cloud.google.com/v1beta1?timeout=32s

So it's requests to all the custom resources we have installed on the cluster.
Only the very last request is for the actual pods:

GET https://10.216.64.18/api/v1/namespaces/default/pods?limit=500

I think the many GET requests trigger the client-side throttling.
This reddit post explains the discovery cache behavior, but not how it can be circumvented or at least not rate limited.

lavalamp · 2021-10-04T21:27:50Z

OK, that was going to be my guess. How many CRDs do you have? You should need over 100 group versions to trigger this.

jonnylangefeld · 2021-10-05T16:25:33Z

Yeah we have a bunch. As mentioned in my comment above it's over 200 individual CRD resources:

$ kubectl get crd | awk 'NR>1' | wc -l
     225

And since you said only distinct group versions matter, I ran the following:

$ kubectl get crd -o json | jq -r '.items[] | "Group: " + .spec.group + "; Version: " + .spec.versions[0].name' | sort | uniq | wc -l
     72

So not quite 100 group versions but close.
We reach that fairly quick with istio, Google's Config Connector, gatekeeper, velero, cert-manager and a few in-house controllers.

Now the question would be: What options do we have to either extend the validity of the cache (so it can rely on it longer), or maybe even better only build the cache if I'm querying for a resource that's not already in the discovery cache?
Do you think that would be feasible?

lavalamp · 2021-10-05T16:43:52Z

That jq query is undercounting (need to emit a name for every version, not just the first).

I have filed an issue about this since no one is going to see this buried here :) #105489

Thanks for the report.

k8s-ci-robot requested review from asalkeld and errordeveloper April 30, 2021 02:45

SataQiu mentioned this pull request Apr 30, 2021

kubectl api returns Throttling request took message #101631

Closed

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 30, 2021

k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 30, 2021

k8s-ci-robot requested a review from logicalhan April 30, 2021 03:14

SataQiu force-pushed the fix-kubectl-log-20210430 branch from 416d8ec to f7bd795 Compare April 30, 2021 06:10

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 30, 2021

logicalhan reviewed Apr 30, 2021

View reviewed changes

k8s-ci-robot assigned logicalhan May 4, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 4, 2021

SataQiu requested a review from logicalhan May 7, 2021 09:47

logicalhan approved these changes May 7, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2021

k8s-ci-robot assigned liggitt May 8, 2021

k8s-ci-robot assigned lavalamp and unassigned liggitt May 10, 2021

k8s-ci-robot closed this May 13, 2021

lavalamp mentioned this pull request Oct 5, 2021

kubectl: discovery is throttled when there are lots of resources (CRDs) #105489

Closed

seans3 mentioned this pull request Oct 5, 2021

Discovery is throttled when there are lots of resources (CRDs) kubernetes/kubectl#1126

Closed

jonnylangefeld mentioned this pull request Dec 10, 2021

Bump discovery burst for kubectl to 300 #105520

Merged

jonnylangefeld mentioned this pull request Jan 3, 2022

Add post "The Kubernetes Discovery Cache: Blessing and Curse" jonnylangefeld/jonnylangefeld.com#12

Merged

vadasambar mentioned this pull request Jul 27, 2023

Client-side throttling causes wait on request to K8s API argoproj/argo-workflows#11069

Closed

Conversation

SataQiu commented Apr 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

SataQiu commented Apr 30, 2021

Uh oh!

SataQiu commented Apr 30, 2021

Uh oh!

SataQiu commented Apr 30, 2021

Uh oh!

SataQiu commented Apr 30, 2021

Uh oh!

logicalhan left a comment

Choose a reason for hiding this comment

Uh oh!

jonnylangefeld commented May 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

caesarxuchao commented May 4, 2021

Uh oh!

k8s-ci-robot commented May 4, 2021

Uh oh!

caesarxuchao commented May 4, 2021

Uh oh!

logicalhan left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented May 7, 2021

Uh oh!

SataQiu commented May 8, 2021

Uh oh!

liggitt commented May 10, 2021

Uh oh!

liggitt commented May 10, 2021

Uh oh!

jonnylangefeld commented May 10, 2021

Uh oh!

caesarxuchao commented May 11, 2021

Uh oh!

deads2k commented May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SataQiu commented May 13, 2021

Uh oh!

k8s-ci-robot commented May 13, 2021

Uh oh!

lavalamp commented May 13, 2021

Uh oh!

jonnylangefeld commented Oct 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lavalamp commented Oct 4, 2021

Uh oh!

jonnylangefeld commented Oct 4, 2021

Uh oh!

lavalamp commented Oct 4, 2021

Uh oh!

jonnylangefeld commented Oct 4, 2021

Uh oh!

lavalamp commented Oct 4, 2021

Uh oh!

jonnylangefeld commented Oct 5, 2021

Uh oh!

lavalamp commented Oct 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

SataQiu commented Apr 30, 2021 •

edited

Loading

jonnylangefeld commented May 4, 2021 •

edited

Loading

deads2k commented May 11, 2021 •

edited

Loading

jonnylangefeld commented Oct 2, 2021 •

edited

Loading