Skip to content

Add a retroactive one-pager for rate limiting#5415

Merged
negz merged 1 commit intocrossplane:masterfrom
negz:the-going-rate
Feb 25, 2024
Merged

Add a retroactive one-pager for rate limiting#5415
negz merged 1 commit intocrossplane:masterfrom
negz:the-going-rate

Conversation

@negz
Copy link
Copy Markdown
Member

@negz negz commented Feb 21, 2024

Description of your changes

I spent a bunch of time these evening trying to remember how Crossplane's rate limiting works, and why it works that way. I figured I'd write it all down for next time.

I have:

Need help with this checklist? See the cheat sheet.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz requested a review from turkenh February 21, 2024 06:02
@negz negz requested a review from a team as a code owner February 21, 2024 06:02
@negz negz requested a review from phisco February 21, 2024 06:02
@negz negz force-pushed the the-going-rate branch 2 times, most recently from cab334e to 9615979 Compare February 21, 2024 06:08
@negz
Copy link
Copy Markdown
Member Author

negz commented Feb 21, 2024

@turkenh two thoughts come to mind after writing this up.

First, there's not really a need for --max-reconcile-rate in core Crossplane. In theory I like the simplicity of the --max-reconcile-rate flag and the symmetry with providers, but unlike a provider there's no reason to globally limit all Crossplane controllers (e.g. all claims, XRs, packages, etc etc) to achieve a maximum reconcile rate across all of Crossplane, given it's only talking to the API server. At a minimum, we can probably make Crossplane's default --max-reconcile-rate a lot higher (e.g. 100).

Second, I'm not sure why the global rate limiter would affect the workqueue_depth metric as you saw in crossplane-contrib/provider-kubernetes#203. It should be re-adding anything that's rate limited back to the same work queue and thus maintaining the same depth. I would expect to to affect workqueue_duration.

the middleware `Reconciler`, subject to its token bucket reconciler. If there
are sufficient tokens available in the bucket, the reconcile is passed to the
wrapped (inner) `Reconciler` immediately. If there aren't sufficient tokens
available, the reconcile is returned to the tail of the work queue by returning
Copy link
Copy Markdown
Member

@turkenh turkenh Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more fair if we put it back to the head of the queue? Why do we change the ordering in the queue?
Maybe I misunderstood something but, as an analogy, I am waiting in a bank queue and when it is my turn I am told by the officer to go to the tail of the queue because he processed enough for that given hour 😅

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more fair if we put it back to the head of the queue?

It would.

Why do we change the ordering in the queue?

No reason, really. 🙂 It just seems to be how all the Kubernetes work queue rate limiting works, and the topic didn't come up until now.

The queueing is all using https://pkg.go.dev/k8s.io/client-go/util/workqueue. When we return RequeueAfter I
believe that translates to an AddAfter, i.e. "add back to the end of the queue after duration D":

https://github.com/crossplane/crossplane-runtime/blob/v1.15.1/pkg/ratelimiter/reconciler.go#L51

We could potentially improve on this by using a RateLimitingQueue inside our ratelimiter.Reconciler. This way there would be two levels of workqueue:

  • Outer queue handles per-object exponential backoff. Requests sent to back of outer queue.
  • Inner queue handles global max-reconcile-rate. Requests sent to back of inner queue.

I think if we wanted to support requests being sent to the front of the queue after a certain duration we'd need to implement our own work queue.

There's talk of allowing users to supply their own work queue (not only their own rate limiter) in controller-runtime in kubernetes-sigs/controller-runtime#857. If we want to optimize as best as possible, it might be worth pursuing that upstream.

@negz
Copy link
Copy Markdown
Member Author

negz commented Feb 23, 2024

@turkenh Given that this is just documenting how things work today, WDYT about approving so I can merge? We can then discuss how it could be better.

Copy link
Copy Markdown
Member

@turkenh turkenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking time for writing this down 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants