-
Notifications
You must be signed in to change notification settings - Fork 714
bug: race condition? when using mergeGateway and multiple gateways #2968
Description
Description:
I have following installation
% kubectl get gateway -A
NAMESPACE NAME CLASS ADDRESS PROGRAMMED AGE
echoserver foobar eg-internal 10.222.156.49 True 34m
envoy-gateway-system internal eg-internal 10.222.156.49 True 50m
full yaml spec: https://gist.github.com/zetaab/8caa34f5072d5a8efc5c2425c331c561
and httproutes https://gist.github.com/zetaab/149545f3e0ae17c0b925bafd3512d1eb
When I am adding httproutes and envoy proxies are restarting, it will randomly all services unavailable. When envoy pods are starting I can see following in logs
[2024-03-18 06:46:10.571][1][info][upstream] [source/common/listener_manager/lds_api.cc:99] lds: add/update listener 'envoy-gateway-system/internal/https'
[2024-03-18 06:46:10.573][1][info][upstream] [source/common/listener_manager/lds_api.cc:99] lds: add/update listener 'envoy-gateway-system/internal/http'
This means that all services are running (also services that are coming from other gateways than https). However, when the log says
[2024-03-18 07:17:24.768][1][info][upstream] [source/common/listener_manager/lds_api.cc:99] lds: add/update listener 'echoserver/foobar/https-foo'
[2024-03-18 07:17:24.769][1][info][upstream] [source/common/listener_manager/lds_api.cc:99] lds: add/update listener 'envoy-gateway-system/internal/http'
Nothing will work.
% curl https://foo.bar -v -k
* Trying 10.222.156.49:443...
* Connected to foo.bar (10.222.156.49) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
* LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to foo.bar:443
* Closing connection
curl: (35) LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to foo.bar:443
% curl https://eg-int.company.com -v
* Trying 10.222.156.49:443...
* Connected to eg-int.company.com (10.222.156.49) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
* CAfile: /etc/ssl/cert.pem
* CApath: none
* LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to eg-int.company.com:443
* Closing connection
curl: (35) LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to eg-int.company.com:443
Why listeners are loaded in different order sometimes?
Repro steps:
- use mergeGateways
- create two different gateways
- add and delete httproutes
- soonish you should see the situation that listener will fail for some reason (there is no error anywhere but the port is not just listening)
Environment:
eg 1.0.0
Logs:
took listener configurations using egctl egctl config envoy-proxy listener -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gatewayclass=eg-internal
not working:
https://gist.github.com/zetaab/2e0f2f00174d4b6189290e095ebe5cf5
working:
https://gist.github.com/zetaab/08d2f7b3bfbd04a28be14bd990552214
like can be seen: sometimes its missing 400+ rows of configurations. When this happens, I need to delete all other gateways than primary one (located in envoy-gateway-system) and then add other gateways back. Then everything starts to work again.