add rps_threshold & max_rejection_probability#16742
add rps_threshold & max_rejection_probability#16742mattklein123 merged 8 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
|
I'm sorry that this PR came a little late since the last discussion in the related issue(#16392) , because some other things have filled up my time recently. Anyway, it's ready to be reviewed. |
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
|
Assigning over to @tonya11en for first pass. |
tonya11en
left a comment
There was a problem hiding this comment.
Included some high-level comments on the approach to the traffic threshold.
| // If the number of requests in the last second is less than this threshold, the request | ||
| // will not be rejected, even if the success rate is lower than sr_threshold. | ||
| // Defaults to 0. | ||
| config.core.v3.RuntimeUInt32 rps_threshold = 6; |
There was a problem hiding this comment.
I don't think this should only consider the last 1 second. It would make more sense to look at something like the minimum number of samples in the sampling window. It would still accomplish the same thing, but give better behavior for bursty request patterns.
Would that work for your particular use-case?
There was a problem hiding this comment.
To be more clear, you can still represent what I'm suggesting as an RPS threshold. It would just be an "average RPS" threshold across the entirety of the sampling window.
| P_{reject} = \left\{ | ||
| \begin{array}{cl} | ||
| 0 & \ (rps < rps\_threshold) \\ | ||
| min({(\frac{n_{total} - s}{n_{total} + 1})}^\frac{1}{aggression}\ ,\ max\_reject\_probability) & \ (rps \geq rps\_threshold) | ||
| \end{array} | ||
| \right. |
There was a problem hiding this comment.
Can you include a screenshot of this rendered in the docs?
I also don't think it's necessary to change the function here as you're introducing circumstances under which this function applies, not changes to the behavior of the function.
There was a problem hiding this comment.
Your comment reminds me that this does make the formula more complicated. It may be more appropriate to explain these two configuration items after the original formula.
|
/wait |
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
|
Changed LastSampleCounts to averageRps and updated the documentation. |
tonya11en
left a comment
There was a problem hiding this comment.
Thanks for doing this. Tests are very thorough!
The approach makes sense, so there's just a few minor comments.
| rejection probability will be higher for higher success rates. See `Aggression`_ for a more | ||
| detailed explanation. | ||
|
|
||
| In addition, there are two more configuration item: |
There was a problem hiding this comment.
| In addition, there are two more configuration item: | |
| Note that there are additional parameters that affect the rejection probability: |
| } | ||
|
|
||
| if (config_->getController().averageRps() < config_->rpsThreshold()) { | ||
| return Http::FilterHeadersStatus::Continue; |
There was a problem hiding this comment.
Consider throwing a debug log in here.
| : time_source_(time_source), sampling_window_(sampling_window) {} | ||
|
|
||
| uint32_t ThreadLocalControllerImpl::averageRps() const { | ||
| // In the very beginning of sampling, the average rps might be very big, so just skipping from it. |
There was a problem hiding this comment.
I don't quite understand what you mean here. Why do you want to do this?
There was a problem hiding this comment.
The first few requests in the first second may have a short interval, resulting in a big average rps. So skip the calculation of the first second.
There was a problem hiding this comment.
I see, because you're allowing fractional seconds due to the age being measured in microseconds.
You can just set the min sample age to 1sec. Cast to std::chrono::seconds instead of a double and do the integer division. There's no reason for the extra accuracy if the config is specified in uint32.
There was a problem hiding this comment.
Sounds reasonable.
| // Validate max rejection probability | ||
| EXPECT_CALL(runtime_.snapshot_, getDouble("foo.aggression", 1.0)).WillRepeatedly(Return(1.0)); | ||
| EXPECT_CALL(runtime_.snapshot_, getDouble("foo.threshold", 100.0)).WillRepeatedly(Return(100.0)); | ||
| EXPECT_CALL(runtime_.snapshot_, getDouble("foo.max_rejection_probability", 80.0)) | ||
| .WillRepeatedly(Return(10.0)); | ||
|
|
||
| verifyProbabilities(100, 0.0); | ||
| verifyProbabilities(95, 0.05); | ||
| verifyProbabilities(80, 0.1); | ||
| verifyProbabilities(0, 0.1); |
There was a problem hiding this comment.
Can you make this its own test case.
| // Validate historical_data_.size() < 2 | ||
| tlc_.recordSuccess(); | ||
| tlc_.recordFailure(); | ||
| EXPECT_EQ(0, tlc_.averageRps()); |
There was a problem hiding this comment.
I commented on this above, but just reiterating I'm a bit confused as to the purpose.
| EXPECT_EQ(RequestData(3, 3), tlc_.requestCounts()); | ||
| } | ||
|
|
||
| // Test for function: averageRps |
There was a problem hiding this comment.
Terminate the comments with periods pls.
|
/wait |
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
|
/retest |
|
Retrying Azure Pipelines: |
tonya11en
left a comment
There was a problem hiding this comment.
Left 1 more comment. Looking good otherwise.
/wait
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
tonya11en
left a comment
There was a problem hiding this comment.
Sorry, you'll also need to update the release notes. See:
https://github.com/envoyproxy/envoy/blob/main/docs/root/version_history/current.rst
Mention the new features and a change in the default behavior (max rejection probability). We should be good for a maintainer to take a look once that's done.
| using namespace std::chrono; | ||
| auto count_of_seconds = duration_cast<seconds>(ageOfOldestSample()).count(); | ||
|
|
||
| return count_of_seconds == 0 ? 0 : global_data_.requests / count_of_seconds; |
There was a problem hiding this comment.
I think you can just do something like below. There's just less going on that way IMO.
| using namespace std::chrono; | |
| auto count_of_seconds = duration_cast<seconds>(ageOfOldestSample()).count(); | |
| return count_of_seconds == 0 ? 0 : global_data_.requests / count_of_seconds; | |
| using std::chrono::seconds; | |
| seconds secs = std::max(seconds(1), ageOfOldestSample()); | |
| return global_data_.requests / secs.count(); |
|
/wait |
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
Signed-off-by: gaoweiwen <whereisgww@outlook.com>
tonya11en
left a comment
There was a problem hiding this comment.
This LGTM. I think it's ready for a maintainer pass @mattklein123
mattklein123
left a comment
There was a problem hiding this comment.
One small doc question/comment, thanks.
/wait
| * Requests will never be rejeted from this filter if the RPS is lower than 5. | ||
| * Rejection probability will never exceed 80% even if the failure rate is 100%. |
There was a problem hiding this comment.
If this is true and these values are clamped, can you make this more clear in the API docs? If it's not true can you make it more clear here?
There was a problem hiding this comment.
This isn't true all the time-- it's just explaining how to read the configuration above. Github cut out this line just above the additions:
The above configuration can be understood as follows:
There was a problem hiding this comment.
Ah OK sorry, I missed that, thanks.
|
/retest |
|
Retrying Azure Pipelines: |
|
I think this may be causing CI flakes could you take a look? |
…oyproxy#16742) Signed-off-by: gaoweiwen <whereisgww@outlook.com>
Signed-off-by: gaoweiwen whereisgww@outlook.com
Commit Message:
Add two configuration items to the admission_control filter.
rps_thresholdindicates the minimum RPS threshold.max_rejection_probabilityindicates the maximum throttling percentage.details in issue: #16392
Additional Description:
Risk Level: Medium
Testing: Unit test and integration test
Docs Changes: admission_control_filter.rst
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]
[Optional API Considerations:]