Skip to content

high CPU in envoy 1.16.0 #14784

@guyco-redis

Description

@guyco-redis

After creating some listeners, we saw a spike in the CPU usage of envoy to 600% (8 cores), although there was no traffic.
The listeners we created were set to serve ports that were taken by another process (that eventually released them). Although we created 6 listeners, that serve 6 different ports, we only saw envoy listening to 3 ports.
We didn't have cpu profiling enabled and it's not reproducible, but we do have strace logs that show envoy's workers performing this in a loop, which we think causes the CPU spike:

epoll_wait(41, [{EPOLLHUP, {u32=57, u64=57}}, {EPOLLHUP, {u32=58, u64=58}}, {EPOLLHUP, {u32=59, u64=59}}], 32, 44) = 3
accept4(57, 0x7f458aad5070, 0x7f458aad5008, SOCK_NONBLOCK) = -1 EINVAL (Invalid argument)
accept(57, 0x7f458aad5070, 0x7f458aad5008) = -1 EINVAL (Invalid argument)
accept4(58, 0x7f458aad5070, 0x7f458aad5008, SOCK_NONBLOCK) = -1 EINVAL (Invalid argument)
accept(58, 0x7f458aad5070, 0x7f458aad5008) = -1 EINVAL (Invalid argument)
accept4(59, 0x7f458aad5070, 0x7f458aad5008, SOCK_NONBLOCK) = -1 EINVAL (Invalid argument)
accept(59, 0x7f458aad5070, 0x7f458aad5008) = -1 EINVAL (Invalid argument)
epoll_wait(41, [{EPOLLHUP, {u32=57, u64=57}}, {EPOLLHUP, {u32=58, u64=58}}, {EPOLLHUP, {u32=59, u64=59}}], 32, 44) = 3
accept4(57, 0x7f458aad5070, 0x7f458aad5008, SOCK_NONBLOCK) = -1 EINVAL (Invalid argument)
accept(57, 0x7f458aad5070, 0x7f458aad5008) = -1 EINVAL (Invalid argument)
accept4(58, 0x7f458aad5070, 0x7f458aad5008, SOCK_NONBLOCK) = -1 EINVAL (Invalid argument)
accept(58, 0x7f458aad5070, 0x7f458aad5008) = -1 EINVAL (Invalid argument)
accept4(59, 0x7f458aad5070, 0x7f458aad5008, SOCK_NONBLOCK) = -1 EINVAL (Invalid argument)
accept(59, 0x7f458aad5070, 0x7f458aad5008) = -1 EINVAL (Invalid argument)

We think that the missing 3 ports are those mentioned by this strace. By envoy's logs, these listeners were created successfully.

In our scenario, the other process stops serving these ports and envoy starts to - but these events are asynchronous so there's a period of time (several seconds) where envoy tries to bind the ports and gets refused by "address already used" - but we only saw it for the "good ports", and didn't saw it for the ports that it did not serve (for them we only saw a successful log for the creation of the listeners).

The question is - could envoy possibly create the listeners and think it binds the port although it's unsuccessful, that could lead to the above loop?

Running envoy 1.16.0 on ubuntu 16.04.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/listenerbugstalestalebot believes this issue/PR has not been touched recently

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions