overload: tcp connection refusal overload action#13311
overload: tcp connection refusal overload action#13311mattklein123 merged 12 commits intoenvoyproxy:masterfrom
Conversation
This will allow creating an overload action that rejects (accepts the connection and then closes it) some fraction of connections in response to overload conditions. Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
|
I still need to add documentation, but I wanted to get the code parts in shape first. |
alyssawilk
left a comment
There was a problem hiding this comment.
Looks great overall - thanks for picking this up!
| } | ||
|
|
||
| void ConnectionHandlerImpl::setListenerRejectFraction(float reject_fraction) { | ||
| disable_listeners_ = false; |
There was a problem hiding this comment.
seems strange to change this boolean without resuming listening.
Also if new listeners are added won't they be added listening but without a rejection fraction?
There was a problem hiding this comment.
Thanks, this was a bad copy-paste on my part. Fixed in the latest commit, and added tests.
| return true; | ||
| } | ||
|
|
||
| return random_generator.random() < |
There was a problem hiding this comment.
maybe comment why this is overflow-safe?
| * of the level of saturation. | ||
| */ | ||
| bool isRandomizedActive(Random::RandomGenerator& random_generator) const { | ||
| if (isInactive()) { |
There was a problem hiding this comment.
Is this just for performance, or do the edges not work well with the casts?
There was a problem hiding this comment.
This was stolen from #13120, but I assume it's for performance - don't grab a random number if it's not necessary. That's why this PR is still a draft.
Signed-off-by: Alex Konradi <akonradi@google.com>
Address feedback around adding new listeners when the reject fraction is already set, and add tests. Signed-off-by: Alex Konradi <akonradi@google.com>
alyssawilk
left a comment
There was a problem hiding this comment.
I think the structure looks good overall, so want to ping when you have docs and relnotes and such? It wouldn't hurt to have an integration test too. Even if we can only usefully test the case where we're rejecting 100%, we could test accepting, config reload, and rejecting.
| } | ||
|
|
||
| if (rejectCxOverGlobalLimit()) { | ||
| if (rejectCxOverGlobalLimit() || reject_fraction_.isRandomizedActive(random_)) { |
There was a problem hiding this comment.
Once #13305 lands (which I think will before this) I thin it'd be useful to tag why the connection was closed.
There was a problem hiding this comment.
I don't think we can tag anything here since this happens before an Envoy connection object is created, so there's nothing to annotate.
| if (rejectCxOverGlobalLimit() || reject_fraction_.isRandomizedActive(random_)) { | ||
| // The global connection limit has been reached. | ||
| io_handle->close(); | ||
| cb_.onReject(); |
There was a problem hiding this comment.
Looks like this increments stats_.downstream_global_cx_overflow_.inc()
Think we should have a separate limit for overflow?
If not maybe at least update the docs to note the stat covers both overflow cases.
There was a problem hiding this comment.
I've added a separate stat.
…-refusal Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
|
@alyssawilk I've updated the release notes and documentation, PTAL |
Signed-off-by: Alex Konradi <akonradi@google.com>
|
Looks great thanks. Throwing over to Matt for !googler review but in the meantime if you're up for updating the TODOs in PR description that'd be helpful |
|
Thanks, updated TODOs (that I had completely forgotten about) |
|
Windows CI failure looks real. Can you take a look? /wait |
Relax the ordering constraints for events that occur asynchronously on different ends of the connection while maintaining requirements for relative ordering on each end. Signed-off-by: Alex Konradi <akonradi@google.com>
|
Relaxed some of the ordering requirements in the test for things that shouldn't necessarily be synchronized. Looks like the tests are passing now |
mattklein123
left a comment
There was a problem hiding this comment.
Very nice! LGTM with small comments. Thank you.
/wait
| void TcpListenerImpl::disable() { file_event_->setEnabled(0); } | ||
|
|
||
| void TcpListenerImpl::setRejectFraction(const float reject_fraction) { | ||
| reject_fraction_ = reject_fraction; |
There was a problem hiding this comment.
Can you ASSERT 0 <= reject_fraction <= 1? I think the Bernoulli function correctly handles out of bounds but might as well try to check for obvious config/code bugs. (Bonus points would be to introduce a struct wrapper for the float which would check this at point of assignment, etc. but up to you.)
| stats_.downstream_global_cx_overflow_.inc(); | ||
| break; | ||
| case RejectCause::OverloadAction: | ||
| stats_.downstream_cx_overload_reject_.inc(); |
There was a problem hiding this comment.
Please verify this stat increment in one of your tests.
There was a problem hiding this comment.
Added a unit test.
Signed-off-by: Alex Konradi <akonradi@google.com>
…-refusal Signed-off-by: Alex Konradi <akonradi@google.com>
|
Needs main merge. /wait |
…-refusal Signed-off-by: Alex Konradi <akonradi@google.com>
|
Yep, I managed to add new tests in the same place both in this PR and in 13515 🤦 |
* master: (22 commits) http: using CONNECT_ERROR for HTTP/2 (envoyproxy#13519) listener: respect address.pipe.mode (it didn't work) (envoyproxy#13493) examples: Fix more deprecations/warnings in configs (envoyproxy#13529) overload: tcp connection refusal overload action (envoyproxy#13311) tcp: towards pluggable upstreams (envoyproxy#13331) conn_pool: fixing comments (envoyproxy#13520) Prevent SEGFAULT when disabling listener (envoyproxy#13515) Convert overload manager config literals to YAML (envoyproxy#13518) Fix runtime feature variable name (envoyproxy#13533) dependencies: refactor repository location schema utils, cleanups. (envoyproxy#13452) router: fix an invalid ASSERT when encoding metadata frames in the router. (envoyproxy#13511) http2: Proactively disconnect connections flooded when resetting stream (envoyproxy#13482) ci use azp to sync filter example (envoyproxy#13501) mongo_proxy: support configurable command list for metrics (envoyproxy#13494) http local rate limit: note token bucket is shared (envoyproxy#13525) wasm/extensions: Wasm extension policy. (envoyproxy#13526) http: removing envoy.reloadable_features.http1_flood_protection (envoyproxy#13508) build: update ppc64le CI build status shield (envoyproxy#13521) dependencies: enforce dependency shepherd sign-off via RepoKitteh. (envoyproxy#13522) Add no_traffic_healthy_interval (envoyproxy#13336) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Commit Message: Add an overload action to reject (accept and then close) TCP connections
Additional Description:
Add an action
"envoy.overload_actions.reject_incoming_connections"that will rejectincoming connections when Envoy is overloaded. It will randomly choose whether to
reject connections based on the overload factor, with expected behavior at the ends
of the range (reject no connections at 0, reject all at saturation).
Risk Level: low
Testing: ran unit tests
Docs Changes: documented new overload action
Release Notes: documented new overload action