Skip to content

bug: envoy proxy pods restarted always when configuration change #2965

@zetaab

Description

@zetaab

Description:

 % kubectl get pods -n envoy-gateway-system
NAME                                          READY   STATUS        RESTARTS   AGE
envoy-eg-external-a514c411-5ffc5d664b-74g5j   2/2     Running       0          2d19h
envoy-eg-external-a514c411-5ffc5d664b-dhmmg   2/2     Running       0          2d19h
envoy-eg-internal-7f4ff7e4-7fb9c8d7df-8kjgk   2/2     Terminating   0          27s
envoy-eg-internal-7f4ff7e4-7fb9c8d7df-krcqc   2/2     Terminating   0          25s
envoy-eg-internal-7f4ff7e4-8685896d59-4z8n8   1/2     Terminating   0          4m31s
envoy-eg-internal-7f4ff7e4-8685896d59-gqtd7   2/2     Running       0          6s
envoy-eg-internal-7f4ff7e4-8685896d59-lrd4f   2/2     Running       0          4s
envoy-gateway-5987f4589-9h6ts                 1/1     Running       0          3d

When I modify like httproutes it will lead to envoy pod restarts. This is situation that is not really good when using external loadbalancers in front of envoy. I do understand that envoy will drain connections. However, when envoy uses externalTrafficPolicy: Local by default, it means that external loadbalancer will mark only nodes as healthy which contains the envoy pods. Now when these pods are moving between machines, it will always take 10-30 seconds (depends how loadbalancer healthchecks are installed) that services will start replying again.

Repro steps:
use type loadbalancer service in front of envoy, if needed modify the external loadbalancer healthcheck intervals to 60 seconds (to see how it really behaves). Then modify httproute configurations and see when pods start restarting and moving between kubernetes nodes -> it will make the services unavailable for some seconds.

Instead of restarting pods, envoy configurations should be reloaded. Avoid modifying kubernetes deployment configuration itself all the time, it will make downtime when using external loadbalancers and externaltrafficpolicy local - the health checks are not that fast.

Environment:
eg 1.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions