-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
I might probably report 2 distinct problems:
1/ /stats?format={json|prometheus} slowness
2/ /ready endpoint blocked by /stats endpoint when running at the same time.
Maybe I should create 2 issues, but starting with this one to have feedback from the envoy team.
So, we have some envoy instances with a lot of mappings/big config (between 5000 and 10000).
In order to reproduce it, I did create a sample envoy config with 7000 mappings, so it's easy to test.
Problem 1:
With this amount of mapping, the /stats endpoints takes a bit more than 1 second.
The /stats?format={json|prometheus} (either json or prometheus format) is talking more than 6 seconds!
That is more than 5x between /stats and the 2 other ones.
Is that expected?
Problem 2:
When /stats is called, and another client is doing a /ready request, the /ready request seems to be stuck until the /stats request is done.
So, when the /stats?format=prometheus is called and takes 7 seconds (for example), and a /ready request is done by a client at the beginning of this 7 seconds time, the /ready request is also going to take 7 seconds.
It generates a bunch of problems with monitoring, especially because we are using ambassador and ambassardor /check_ready endpoint is a wrapper around the envoy /ready endpoint and it has a 2 seconds timeout (https://github.com/datawire/ambassador/blob/d1a8b1ca89d878b4c8722f51f2479028288b747e/pkg/acp/envoy.go#L61 )
So, if prometheus is scraping at the same time, the readiness is failing.
Repro steps:
(all attachments have an extra .txt extension that should be removed)
I am attaching the "sample-long.yaml" config with the 7000 mappings.
I am also attaching the sample-long.sh" basic bash script to generate the sample config (if needed).
Here is the command I'm using to run envoy locally:
docker run --rm --network=host \
-v $(pwd)/sample-long.yaml:/sample.yaml \
-ti envoyproxy/envoy:v1.18.2 \
--config-path /sample.yaml
I'm attaching the test.sh script to run to test to perf of the different endpoints.
Here is the output of this test.sh script when running on my laptop. My laptop does nothing, envoy does nothing (no traffic except the tests).
> ./test.sh
*********************
*** TEST ready endpoint (it's fast)
0.01user 0.00system 0:00.01elapsed 92%CPU (0avgtext+0avgdata 11860maxresident)k
0inputs+0outputs (0major+650minor)pagefaults 0swaps
*********************
*** TEST stats+ready endpoint (ready endpoint is going to be slow because it's probably locked waiting for stat endpoint)
READY timing (slow :-( ):
0.00user 0.00system 0:08.14elapsed 0%CPU (0avgtext+0avgdata 12152maxresident)k
0inputs+0outputs (0major+660minor)pagefaults 0swaps
---------------
STATS timing:
0.00user 0.02system 0:08.76elapsed 0%CPU (0avgtext+0avgdata 12116maxresident)k
0inputs+0outputs (0major+683minor)pagefaults 0swaps
*********************
*** TEST stat endpoints (basic one: around 1 second)
0.00user 0.01system 0:01.12elapsed 1%CPU (0avgtext+0avgdata 12060maxresident)k
0inputs+0outputs (0major+677minor)pagefaults 0swaps
*********************
*** TEST stat endpoints (json one: more than 5x than basic one)
0.00user 0.01system 0:04.95elapsed 0%CPU (0avgtext+0avgdata 12276maxresident)k
0inputs+0outputs (0major+680minor)pagefaults 0swaps
*********************
*** TEST stat endpoints (prometheus one: more than 5x than basic one)
0.01user 0.01system 0:09.32elapsed 0%CPU (0avgtext+0avgdata 11840maxresident)k
0inputs+0outputs (0major+672minor)pagefaults 0swaps
You can see in the output that the /ready is taking more than 8 seconds when executed at the same time as the /stats?format=prometheus.
Is that expected?
Thank you for any feedback.