Skip to content

bug: race condition? when using mergeGateway and multiple gateways #2968

@zetaab

Description

@zetaab

Description:

I have following installation

% kubectl get gateway -A             
NAMESPACE              NAME       CLASS         ADDRESS          PROGRAMMED   AGE
echoserver             foobar     eg-internal   10.222.156.49    True         34m
envoy-gateway-system   internal   eg-internal   10.222.156.49    True         50m

full yaml spec: https://gist.github.com/zetaab/8caa34f5072d5a8efc5c2425c331c561

and httproutes https://gist.github.com/zetaab/149545f3e0ae17c0b925bafd3512d1eb

When I am adding httproutes and envoy proxies are restarting, it will randomly all services unavailable. When envoy pods are starting I can see following in logs

[2024-03-18 06:46:10.571][1][info][upstream] [source/common/listener_manager/lds_api.cc:99] lds: add/update listener 'envoy-gateway-system/internal/https'
[2024-03-18 06:46:10.573][1][info][upstream] [source/common/listener_manager/lds_api.cc:99] lds: add/update listener 'envoy-gateway-system/internal/http'

This means that all services are running (also services that are coming from other gateways than https). However, when the log says

[2024-03-18 07:17:24.768][1][info][upstream] [source/common/listener_manager/lds_api.cc:99] lds: add/update listener 'echoserver/foobar/https-foo'
[2024-03-18 07:17:24.769][1][info][upstream] [source/common/listener_manager/lds_api.cc:99] lds: add/update listener 'envoy-gateway-system/internal/http'

Nothing will work.

 % curl https://foo.bar -v -k     
*   Trying 10.222.156.49:443...
* Connected to foo.bar (10.222.156.49) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
* LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to foo.bar:443 
* Closing connection
curl: (35) LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to foo.bar:443
% curl https://eg-int.company.com -v
*   Trying 10.222.156.49:443...
* Connected to eg-int.company.com (10.222.156.49) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to eg-int.company.com:443 
* Closing connection
curl: (35) LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to eg-int.company.com:443 

Why listeners are loaded in different order sometimes?

Repro steps:

  1. use mergeGateways
  2. create two different gateways
  3. add and delete httproutes
  4. soonish you should see the situation that listener will fail for some reason (there is no error anywhere but the port is not just listening)

Environment:
eg 1.0.0

Logs:

took listener configurations using egctl egctl config envoy-proxy listener -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gatewayclass=eg-internal

not working:
https://gist.github.com/zetaab/2e0f2f00174d4b6189290e095ebe5cf5

working:
https://gist.github.com/zetaab/08d2f7b3bfbd04a28be14bd990552214

like can be seen: sometimes its missing 400+ rows of configurations. When this happens, I need to delete all other gateways than primary one (located in envoy-gateway-system) and then add other gateways back. Then everything starts to work again.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions