[WIP] Admission Control Filter#10230
Conversation
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
|
/wait-any |
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
|
Basic HTTP integration test is implemented. Since we're validating probabilities, I ensured it wasn't a flaky test: |
Signed-off-by: Tony Allen <tony@allen.gg>
|
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
Actively wrestling with GRPC integration test. |
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
|
Sorry for the delay- it turns out that when sending the local reply 503 in the filter, the GRPC status is not set. This caused the integration client to observe the upstream probabilistically return HTTP 503s with no GRPC status 50% of the time, or HTTP 200 with the “Unknown” GRPC status in the trailer. I'm foregoing documentation until the general behavior of this filter has ossified. Once that happens, I'll take this PR out of draft along with pushing a bunch of docs. |
|
@mattklein123 after you take your first pass through the code, I'll pull this PR out of draft and remove the WIP assuming no major architectural changes are required. |
|
|
||
| private: | ||
| std::vector<std::function<bool(uint64_t)>> http_success_fns_; | ||
| std::unordered_set<uint64_t> grpc_success_codes_; |
There was a problem hiding this comment.
It just occurred to me that there are a handful of gRPC success codes. It'll be more performant to just do a sequential search over a vector instead of hashing in an unordered set.
I'll make that change next round.
mattklein123
left a comment
There was a problem hiding this comment.
Thanks I left a few comments. In general this looks great but I take back what I said about a single PR. Can we split this into at least 1) the ancillary changes, 2) the controller, and 3) the filter?
Thank you!
/wait
| string runtime_key = 3 [(validate.rules).string = {min_bytes: 1}]; | ||
| } | ||
|
|
||
| // Runtime derived double with a default when not specified. |
There was a problem hiding this comment.
Since this PR is so large please split this and associated tests into a different PR (if there is anything else that can be split out please do so)
| // values. | ||
| message DefaultSuccessCriteria { | ||
| // If HTTP statuses are unspecified, defaults to 2xx. | ||
| repeated HttpStatusRange http_status = 1; |
There was a problem hiding this comment.
I think I would make this an actual numeric range for flexibility. See Int32Range or similar as something you might potentially use.
| api.v2.core.RuntimeFeatureFlag enabled = 1; | ||
|
|
||
| // The time window over which the success rate is calculated. The window is rounded to the nearest | ||
| // second. Defaults to 120s. |
There was a problem hiding this comment.
Bump still needs more verbiage to talk about sliding, etc.
| // GRPC status. | ||
| message GrpcStatus { | ||
| // Supplies GRPC response code. | ||
| Status status = 1 [(validate.rules).enum = {defined_only: true not_in: 0}]; |
| // [#protodoc-title: GRPC status codes] | ||
|
|
||
| // GRPC response codes supported. | ||
| enum Status { |
There was a problem hiding this comment.
@lizan does this exist anywhere else? Any thoughts on defining this in our API?
There was a problem hiding this comment.
No this doesn't exist. I don't think we should define this as enum but just as int32 (that's what google.rpc.Status does) because in data plane there will be custom defined error codes as well.
| * The look-back window for request samples is accurate up to a hard-coded 1-second granularity. | ||
| * TODO (tonya11en): Allow the granularity to be configurable. | ||
| */ | ||
| class ThreadLocalController { |
There was a problem hiding this comment.
Yeah given the size of this PR if possible maybe split the controller out into its own PR? If it's a big deal I can do in one review but it would be nice to not have so much code to look at.
|
Please merge master to pick up #10672. We no longer accept changes to v2 (without explicit exception), so any API modifications should happen in v3. If this PR is adding a new proto, please follow the updated instructions in https://github.com/envoyproxy/envoy/blob/master/api/STYLE.md#adding-an-extension-configuration-to-the-api. |
|
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This patch is a starts progress toward #9658. It is still a work in progress, but I'm opening a draft PR because I'd like to to discuss the general approach before writing integration tests, documentation, etc.
A quick summary of the motivation and origin of the idea can be found in the description of #9658.
The filter tracks request success rate for each worker thread and uses the information to reject requests with some probability dictated by the SR (in some rolling window) before forwarding to the upstream.This removes the need for any locks such as those found in the local ratelimit filter and the adaptive concurrency filter.
The per-thread success rate calculation is performed by tracking the total request count and the total number of successes. These values are accumulated each second and inserted into a deque that contains the per-second accumulated values for the entire rolling window. This allows us to efficiently phase out stale SR data as it is no longer in the time window.
There are a few items I'd appreciate some preliminary feedback on. Once there's a consensus, I'll write up documentation and integration tests. The two major items I'd like any feedback/comments on are: