Skip to content

[WIP] Admission Control Filter#10230

Closed
tonya11en wants to merge 64 commits intoenvoyproxy:masterfrom
tonya11en:admctl
Closed

[WIP] Admission Control Filter#10230
tonya11en wants to merge 64 commits intoenvoyproxy:masterfrom
tonya11en:admctl

Conversation

@tonya11en
Copy link
Copy Markdown
Member

@tonya11en tonya11en commented Mar 3, 2020

This patch is a starts progress toward #9658. It is still a work in progress, but I'm opening a draft PR because I'd like to to discuss the general approach before writing integration tests, documentation, etc.

A quick summary of the motivation and origin of the idea can be found in the description of #9658.

The filter tracks request success rate for each worker thread and uses the information to reject requests with some probability dictated by the SR (in some rolling window) before forwarding to the upstream.This removes the need for any locks such as those found in the local ratelimit filter and the adaptive concurrency filter.

The per-thread success rate calculation is performed by tracking the total request count and the total number of successes. These values are accumulated each second and inserted into a deque that contains the per-second accumulated values for the entire rolling window. This allows us to efficiently phase out stale SR data as it is no longer in the time window.

There are a few items I'd appreciate some preliminary feedback on. Once there's a consensus, I'll write up documentation and integration tests. The two major items I'd like any feedback/comments on are:

  • What would be the best way to configure the definition of a "successful" request? I thought something similar to the retry_on categories would do the trick, but this might not work for grpc.
  • Could there be workloads that could see potential problems with this per-thread SR accounting approach?

Tony Allen and others added 30 commits January 10, 2020 15:13
wip
Signed-off-by: Tony Allen <tallen@lyft.com>
wip
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tony@allen.gg>
@tonya11en
Copy link
Copy Markdown
Member Author

/wait-any

Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
@tonya11en
Copy link
Copy Markdown
Member Author

Basic HTTP integration test is implemented. Since we're validating probabilities, I ensured it wasn't a flaky test:

± % btd //test/extensions/filters/http/admission_control:admission_control_integration_test --runs_per_test=1000
WARNING: The following rc files are no longer being read, please transfer their contents or import their path into one of the standard rc files:
/home/tallen/src/envoy/tools/bazel.rc
INFO: Invocation ID: c4325847-2684-473e-8c0a-7ac2ce6ae1c8
INFO: Analyzed target //test/extensions/filters/http/admission_control:admission_control_integration_test (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
Target //test/extensions/filters/http/admission_control:admission_control_integration_test up-to-date:
  bazel-bin/test/extensions/filters/http/admission_control/admission_control_integration_test
INFO: Elapsed time: 1012.032s, Critical Path: 49.45s
INFO: 1003 processes: 1 remote cache hit, 1002 linux-sandbox.
INFO: Build completed successfully, 1004 total actions
//test/extensions/filters/http/admission_control:admission_control_integration_test PASSED in 24.8s
  Stats over 1000 runs: max = 24.8s, min = 18.4s, avg = 23.5s, dev = 0.7s

Executed 1 out of 1 test: 1 test passes.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line oINFO: Build completed successfully, 1004 total actions

Signed-off-by: Tony Allen <tony@allen.gg>
@stale
Copy link
Copy Markdown

stale bot commented Mar 20, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 20, 2020
@tonya11en
Copy link
Copy Markdown
Member Author

Actively wrestling with GRPC integration test.

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Mar 20, 2020
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
@tonya11en
Copy link
Copy Markdown
Member Author

Sorry for the delay- it turns out that when sending the local reply 503 in the filter, the GRPC status is not set. This caused the integration client to observe the upstream probabilistically return HTTP 503s with no GRPC status 50% of the time, or HTTP 200 with the “Unknown” GRPC status in the trailer.

I'm foregoing documentation until the general behavior of this filter has ossified. Once that happens, I'll take this PR out of draft along with pushing a bunch of docs.

Signed-off-by: Tony Allen <tony@allen.gg>
@tonya11en
Copy link
Copy Markdown
Member Author

@mattklein123 after you take your first pass through the code, I'll pull this PR out of draft and remove the WIP assuming no major architectural changes are required.


private:
std::vector<std::function<bool(uint64_t)>> http_success_fns_;
std::unordered_set<uint64_t> grpc_success_codes_;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just occurred to me that there are a handful of gRPC success codes. It'll be more performant to just do a sequential search over a vector instead of hashing in an unordered set.

I'll make that change next round.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I left a few comments. In general this looks great but I take back what I said about a single PR. Can we split this into at least 1) the ancillary changes, 2) the controller, and 3) the filter?

Thank you!

/wait

string runtime_key = 3 [(validate.rules).string = {min_bytes: 1}];
}

// Runtime derived double with a default when not specified.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR is so large please split this and associated tests into a different PR (if there is anything else that can be split out please do so)

// values.
message DefaultSuccessCriteria {
// If HTTP statuses are unspecified, defaults to 2xx.
repeated HttpStatusRange http_status = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would make this an actual numeric range for flexibility. See Int32Range or similar as something you might potentially use.

api.v2.core.RuntimeFeatureFlag enabled = 1;

// The time window over which the success rate is calculated. The window is rounded to the nearest
// second. Defaults to 120s.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump still needs more verbiage to talk about sliding, etc.

// GRPC status.
message GrpcStatus {
// Supplies GRPC response code.
Status status = 1 [(validate.rules).enum = {defined_only: true not_in: 0}];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the not_in 0 do?

// [#protodoc-title: GRPC status codes]

// GRPC response codes supported.
enum Status {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lizan does this exist anywhere else? Any thoughts on defining this in our API?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this doesn't exist. I don't think we should define this as enum but just as int32 (that's what google.rpc.Status does) because in data plane there will be custom defined error codes as well.

* The look-back window for request samples is accurate up to a hard-coded 1-second granularity.
* TODO (tonya11en): Allow the granularity to be configurable.
*/
class ThreadLocalController {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah given the size of this PR if possible maybe split the controller out into its own PR? If it's a big deal I can do in one review but it would be nice to not have so much code to look at.

@htuch
Copy link
Copy Markdown
Member

htuch commented Apr 7, 2020

Please merge master to pick up #10672. We no longer accept changes to v2 (without explicit exception), so any API modifications should happen in v3. If this PR is adding a new proto, please follow the updated instructions in https://github.com/envoyproxy/envoy/blob/master/api/STYLE.md#adding-an-extension-configuration-to-the-api.

@stale
Copy link
Copy Markdown

stale bot commented Apr 14, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Apr 14, 2020
@stale
Copy link
Copy Markdown

stale bot commented Apr 25, 2020

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot closed this Apr 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api stale stalebot believes this issue/PR has not been touched recently waiting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants