Skip to content

Conversation

@jprenken
Copy link
Contributor

@jprenken jprenken commented Sep 24, 2025

Add a refreshOverrides func to the ratelimits.limitRegistry struct. Instead of populating the static overrides field once when creating an instance of the struct, call the new func at startup and then every 30 minutes.

Emit relevant logs and metrics from limitRegistry.

Add an OverridesFromDB limiter config flag (for RA & WFE) to read overrides from the DB instead of a file.

Flatten newLimitRegistry.*() methods' logic into their sole caller, NewTransactionBuilder().

Rename loadDefaults() & loadOverrides(), appending FromFile for clarity/consistency.

test: Add ra-sct-provider dependency on SA.

Important for deployment: If the OverridesFromDB config flag is enabled, an RA now depends on the SA in order to load overrides. The RA must be added as a gRPC client of sa.StorageAuthorityReadOnly.

CPS Compliance Review: OverridesFromDB only controls how we load rate limit overrides, which has no compliance implications beyond general API availability (e.g. for revocation). I've checked our CP/CPS to confirm we make no related stipulations.

Fixes #8382

@github-actions
Copy link
Contributor

@jprenken, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

@github-actions
Copy link
Contributor

@jprenken, this PR adds one or more new feature flags: OverridesFromDB. As such, this PR must be accompanied by a review of the Let's Encrypt CP/CPS to ensure that our behavior both before and after this flag is flipped is compliant with that document.

Please conduct such a review, then add your findings to the PR description in a paragraph beginning with "CPS Compliance Review:".

Copy link
Contributor

@jsha jsha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some initial style/organizational comments, but I also have a high-level concern. We've talked before about how blocking the process on a database query might become an issue, for instance if we are restarting a datacenter and the RA comes up before the database. It would be nice to not have an ordering dependency on startup; everything can just come up whenever and eventually become healthy. I see there's some code to backoff and retry the SA queries, but I suspect it will max out pretty quickly.

But there's also an issue of how long it takes even in normal operation to load the rate limit overrides. We've generally assumed these are going to be pretty fast, because there's not that many of them. But noticing that the RA is streaming results from the SA made me think again about how the list of overrides is only going to grow forever. Also, the growth rate is somewhat out of our control if we auto approve some requests. So we could wind up where an RA takes several seconds or potentially even minutes to start up, because it's loading overrides.

As a related issue: NewTransactionBuilder() does an SA RPC internally. In general I tend to assume constructor-like things run pretty fast and don't block on external services. How about this?

  • NewTransactionBuilder() returns quickly, with a TransactionBuilder that has no overrides loaded.
  • TransactionBuilder.NewRefresher attempts its first DB load immediately.
  • All DB loads have some amount of backoff and retry on error.
  • TransactionBuilder gets a new method .Ready() that returns true if it has successfully loaded overrides once.
  • The RA's health check return true if:
  • TransactionBuilder.Ready() returns true, OR
  • N seconds have passed since startup.

That last point allows us to avoid serving without overrides data on most reloads, but also ensure that a failure to load the overrides doesn't completely prevent the RA from serving. And since we'll still have the refresher running, the RA gets additional chances to load overrides if the first one fails.

@jprenken jprenken requested a review from jsha October 25, 2025 22:07
@jprenken jprenken requested a review from jsha October 27, 2025 23:58
@jprenken jprenken requested a review from jsha October 28, 2025 01:01
jsha
jsha previously approved these changes Oct 28, 2025
Copy link
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code and logic generally look great. A few high-level structural nits, but nothing that should be hard to address.

Copy link
Member

@beautifulentropy beautifulentropy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work on this, I'm very happy with how it turned out!

Comment on lines +114 to +115
// Health implements our grpc.checker interface. This method will be called
// periodically to set the gRPC service's healthpb.Health.Check() status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just something that crossed my mind when reading this, no need to address this here: #8471

@jprenken jprenken merged commit d1e5e16 into main Oct 31, 2025
15 checks passed
@jprenken jprenken deleted the rlo-reload branch October 31, 2025 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wfe/ra: Periodically load rate limit overrides from the database

5 participants