Skip to content

adaptive concurrency: Filter configuration and integration tests#8244

Merged
mattklein123 merged 40 commits intoenvoyproxy:masterfrom
tonya11en:acc_2
Oct 1, 2019
Merged

adaptive concurrency: Filter configuration and integration tests#8244
mattklein123 merged 40 commits intoenvoyproxy:masterfrom
tonya11en:acc_2

Conversation

@tonya11en
Copy link
Copy Markdown
Member

@tonya11en tonya11en commented Sep 14, 2019

This patch brings the concurrency controller and the adaptive concurrency control (ACC) filter together to make the filter fully useable with integration tests.

The integration tests make use of the fault filter, placed after the ACC filter, to emulate request latencies and to also move time forward. Even though this is against the recommendation in the documentation to have the filter come before all other filters, it was necessary to have last. Not every request makes it to the upstream due to occasionally being blocked by the ACC filter, so it was not possible to simply call timeSystem().sleep and rely on the autonomous upstream to reply to the requests.

Patch 3/n towards #7789.

Temporal behavior

Now that all the pieces are together, it's worth comparing the filter's behavior with what we expecte from the proof-of-concept. Time sampled stats show similar behavior to the proof-of-concept filter's data in #7789. Here's a plot of latencies without the adaptive concurrency filter:
oss_noacc

Now with the adaptive concurrency filter:
oss_acc

Tony Allen added 7 commits September 12, 2019 18:20
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
@mattklein123 mattklein123 self-assigned this Sep 14, 2019
Signed-off-by: Tony Allen <tony@allen.gg>
@tonya11en
Copy link
Copy Markdown
Member Author

/wait

Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
Signed-off-by: Tony Allen <tony@allen.gg>
@tonya11en tonya11en marked this pull request as ready for review September 16, 2019 17:32
@tonya11en
Copy link
Copy Markdown
Member Author

Marking as ready since the CI release test just flaked out. The thread_local_store_test works fine for me locally when tested on release:

INFO: Elapsed time: 282.725s, Critical Path: 152.57s
INFO: 1329 processes: 1329 linux-sandbox.
INFO: Build completed successfully, 1687 total actions
//test/common/stats:thread_local_store_test                              PASSED in 27.3s

Executed 1 out of 1 test: 1 test passes.
INFO: Build completed successfully, 1687 total actions

tallen@legion ~/src/envoy [acc_2]
± % git log | head -1                                                                                                                  !10376
commit 6f5684661ddcdba6cc38f6f42fbcaef8da3ea149 (HEAD -> acc_2, tony/acc_2)

@@ -27,8 +27,8 @@ AdaptiveConcurrencyFilter::AdaptiveConcurrencyFilter(
Http::FilterHeadersStatus AdaptiveConcurrencyFilter::decodeHeaders(Http::HeaderMap&, bool) {
if (controller_->forwardingDecision() == ConcurrencyController::RequestForwardingAction::Block) {
// TODO (tonya11en): Remove filler words.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the TODO

Signed-off-by: Tony Allen <tallen@lyft.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, awesome to see this coming together. A few nits, but my main comment is I would rather this be done with simulated time or at a minimum have a TODO explaining why it can't be and also do a de-flaking run either way. Thank you!

/wait


void initializeFilter() {
// We use the fault filter (for delays) after the adaptive concurrency filter to introduce a
// "service latency" to the test. This way, time is moved forward with each response.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you just use simulated time and just advance time before upstream responds?

Wherever we end up on this, please verify that you have done a de-flaking run on these new tests (--runs_per_test=1000) to verify a lack of flakes.

Signed-off-by: Tony Allen <tony@allen.gg>
@tonya11en
Copy link
Copy Markdown
Member Author

/wait

mattklein123
mattklein123 previously approved these changes Sep 26, 2019
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good. You can fix the nits in your next change.

@mattklein123
Copy link
Copy Markdown
Member

Please merge master. (Might as well fix the nits that I pointed out at the same time since you have to push again.)

/wait

Tony Allen added 3 commits September 30, 2019 10:57
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
mattklein123
mattklein123 previously approved these changes Sep 30, 2019
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. @tonya11en please confirm that you have run the integration tests under --runs_per_test to check for flakes?

@tonya11en
Copy link
Copy Markdown
Member Author

Not recently- let me go ahead and kick off 1000 and report back before merge.

Signed-off-by: Tony Allen <tallen@lyft.com>
@tonya11en
Copy link
Copy Markdown
Member Author

@mattklein123 looks stable:

>> btd //test/extensions/filters/http/adaptive_concurrency:adaptive_concurrency_integration_test --runs_per_test=1000 --test_timeout=20 --jobs=1               
WARNING: The following rc files are no longer being read, please transfer their contents or import their path into one of the standard rc files:
/home/tallen/src/envoy/tools/bazel.rc
INFO: Invocation ID: 457c1652-a84a-4ea4-8ef9-fd8f885fee03
INFO: Analyzed target //test/extensions/filters/http/adaptive_concurrency:adaptive_concurrency_integration_test (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
Target //test/extensions/filters/http/adaptive_concurrency:adaptive_concurrency_integration_test up-to-date:
  bazel-bin/test/extensions/filters/http/adaptive_concurrency/adaptive_concurrency_integration_test
INFO: Elapsed time: 1428.001s, Critical Path: 1.56s
INFO: 1000 processes: 1000 linux-sandbox.
INFO: Build completed successfully, 1001 total actions
//test/extensions/filters/http/adaptive_concurrency:adaptive_concurrency_integration_test PASSED in 1.5s
  Stats over 1000 runs: max = 1.5s, min = 1.3s, avg = 1.4s, dev = 0.0s

Executed 1 out of 1 test: 1 test passes.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see whINFO: Build completed successfully, 1001 total actions

I needed to make sure there weren't multiple copies of this integration test running at the same time. When running it with 12 concurrency workers, there was a 2% failure rate, but from my investigations, I think this was due to the integration test framework and the fake upstreams.

@mattklein123
Copy link
Copy Markdown
Member

I needed to make sure there weren't multiple copies of this integration test running at the same time. When running it with 12 concurrency workers, there was a 2% failure rate, but from my investigations, I think this was due to the integration test framework and the fake upstreams.

This means it will definitely flake in CI. It should work 100% of the time with concurrent workers, so we will need to investigate this unfortunately.

/wait

Signed-off-by: Tony Allen <tallen@lyft.com>
@tonya11en
Copy link
Copy Markdown
Member Author

Alright, now we're cooking:

>> btd //test/extensions/filters/http/adaptive_concurrency:adaptive_concurrency_integration_test --runs_per_test=1000 --test_timeout=20
WARNING: The following rc files are no longer being read, please transfer their contents or import their path into one of the standard rc files:
/home/tallen/src/envoy/tools/bazel.rc
INFO: Invocation ID: e2f638f0-7bf8-4082-bf2e-0fa2655362c9
INFO: Build options --cache_test_results, --run_under, --runs_per_test, and 1 more have changed, discarding analysis cache.
INFO: Analyzed target //test/extensions/filters/http/adaptive_concurrency:adaptive_concurrency_integration_test (0 packages loaded, 14062 targets configured).
INFO: Found 1 test target...
Target //test/extensions/filters/http/adaptive_concurrency:adaptive_concurrency_integration_test up-to-date:
  bazel-bin/test/extensions/filters/http/adaptive_concurrency/adaptive_concurrency_integration_test
INFO: Elapsed time: 127.585s, Critical Path: 1.79s
INFO: 1000 processes: 1000 linux-sandbox.
INFO: Build completed successfully, 1001 total actions
//test/extensions/filters/http/adaptive_concurrency:adaptive_concurrency_integration_test PASSED in 1.8s
  Stats over 1000 runs: max = 1.8s, min = 1.3s, avg = 1.5s, dev = 0.1s

Executed 1 out of 1 test: 1 test passes.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see whINFO: Build completed successfully, 1001 total actions

12 concurrent jobs this time.

There were actually 2 races that needed to be fixed:

  1. The adaptive concurrency filter was responding to requests that it was rejecting before the test was able to send the request data. This caused the integration test's codec client to use a bogus stream and segfault. This was fixed by just sending header-only requests.
  2. The header-only requests were then sent so quickly that the counters were non-deterministic. This required us to wait for the blocked counter to be incremented after each request before proceeding. Now the stats are as we'd expect.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for fixing the flakes Tony!

@mattklein123 mattklein123 merged commit a119a07 into envoyproxy:master Oct 1, 2019
danzh2010 pushed a commit to danzh2010/envoy that referenced this pull request Oct 4, 2019
@tonya11en tonya11en deleted the acc_2 branch November 13, 2019 04:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants