slow stats endpoint and blocking /ready endpoint when hitting /stats endpoint

I might probably report 2 distinct problems:

1/ `/stats?format={json|prometheus}` slowness
2/ `/ready` endpoint blocked by `/stats` endpoint when running at the same time.

Maybe I should create 2 issues, but starting with this one to have feedback from the envoy team.

So, we have some envoy instances with a lot of mappings/big config (between 5000 and 10000).
In order to reproduce it, I did create a sample envoy config with 7000 mappings, so it's easy to test.

Problem 1:
========
With this amount of mapping, the `/stats` endpoints takes a bit more than 1 second.
The `/stats?format={json|prometheus}` (either json or prometheus format) is talking more than 6 seconds!
That is more than 5x between `/stats` and the 2 other ones.
Is that expected?

Problem 2:
==========
When `/stats` is called, and another client is doing a `/ready` request, the `/ready` request seems to be stuck until the `/stats` request is done.
So, when the `/stats?format=prometheus` is called and takes 7 seconds (for example), and a `/ready` request is done by a client at the beginning of this 7 seconds time, the `/ready` request is also going to take 7 seconds.
It generates a bunch of problems with monitoring, especially because we are using ambassador and ambassardor /check_ready endpoint is a wrapper around the envoy /ready endpoint and it has a 2 seconds timeout (https://github.com/datawire/ambassador/blob/d1a8b1ca89d878b4c8722f51f2479028288b747e/pkg/acp/envoy.go#L61 )

So, if prometheus is scraping at the same time, the readiness is failing.

Repro steps:
==========
(all attachments have an extra .txt extension that should be removed)
I am attaching the "sample-long.yaml" config with the 7000 mappings.
I am also attaching the sample-long.sh" basic bash script to generate the sample config (if needed).
Here is the command I'm using to run envoy locally:

```
docker run --rm --network=host \
    -v $(pwd)/sample-long.yaml:/sample.yaml \
    -ti envoyproxy/envoy:v1.18.2 \
    --config-path /sample.yaml
```

I'm attaching the `test.sh` script to run to test to perf of the different endpoints.

Here is the output of this `test.sh` script when running on my laptop. My laptop does nothing, envoy does nothing (no traffic except the tests).

```
> ./test.sh
*********************
*** TEST ready endpoint (it's fast)
0.01user 0.00system 0:00.01elapsed 92%CPU (0avgtext+0avgdata 11860maxresident)k
0inputs+0outputs (0major+650minor)pagefaults 0swaps

*********************
*** TEST stats+ready endpoint (ready endpoint is going to be slow because it's probably locked waiting for stat endpoint)
READY timing (slow :-( ):
0.00user 0.00system 0:08.14elapsed 0%CPU (0avgtext+0avgdata 12152maxresident)k
0inputs+0outputs (0major+660minor)pagefaults 0swaps
---------------
STATS timing:
0.00user 0.02system 0:08.76elapsed 0%CPU (0avgtext+0avgdata 12116maxresident)k
0inputs+0outputs (0major+683minor)pagefaults 0swaps

*********************
*** TEST stat endpoints (basic one: around 1 second)
0.00user 0.01system 0:01.12elapsed 1%CPU (0avgtext+0avgdata 12060maxresident)k
0inputs+0outputs (0major+677minor)pagefaults 0swaps

*********************
*** TEST stat endpoints (json one: more than 5x than basic one)
0.00user 0.01system 0:04.95elapsed 0%CPU (0avgtext+0avgdata 12276maxresident)k
0inputs+0outputs (0major+680minor)pagefaults 0swaps

*********************
*** TEST stat endpoints (prometheus one: more than 5x than basic one)
0.01user 0.01system 0:09.32elapsed 0%CPU (0avgtext+0avgdata 11840maxresident)k
0inputs+0outputs (0major+672minor)pagefaults 0swaps
```

You can see in the output that the `/ready` is taking more than 8 seconds when executed at the same time as the `/stats?format=prometheus`.
Is that expected?

Thank you for any feedback.

[sample-long.sh.txt](https://github.com/envoyproxy/envoy/files/6455600/sample-long.sh.txt)

[test.sh.txt](https://github.com/envoyproxy/envoy/files/6455601/test.sh.txt)

[sample-long.yaml.txt](https://github.com/envoyproxy/envoy/files/6455602/sample-long.yaml.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow stats endpoint and blocking /ready endpoint when hitting /stats endpoint #16425

Problem 1:

Problem 2:

Repro steps:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

slow stats endpoint and blocking /ready endpoint when hitting /stats endpoint #16425

Description

Problem 1:

Problem 2:

Repro steps:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions