-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Gateway API TLS termination fails permanently if API server is temporarily unreachable during operator startup #43130
Description
Gateway API TLS termination fails permanently if API server is temporarily unreachable during operator startup
Is there an existing issue for this?
- I have searched the existing issues
Related issue: #32596 (closed as stale without resolution)
What happened?
When Cilium operator starts and the Kubernetes API server is temporarily unreachable (even for a few seconds), the Gateway API controller fails to initialize and never recovers. This results in:
- TLS secrets not being synchronized to
cilium-secretsnamespace - Envoy proxy starting with 0 TLS secrets
- All HTTPS traffic through Gateway API failing with "Connection reset by peer" during TLS handshake
- HTTP traffic continues to work (redirects to HTTPS)
- Gateway resources show
Programmed: Trueeven though TLS termination is broken
Expected behavior: Cilium operator should retry Gateway API initialization or use a watch-based approach that recovers when API server becomes available.
Actual behavior: Gateway API controller initialization fails once and is never retried. The operator continues running but Gateway API secret sync is permanently broken until operator pod is manually restarted.
How can we reproduce the issue?
- Set up a Kubernetes cluster with Cilium and Gateway API enabled
- Create a Gateway with TLS termination using cert-manager certificates
- Verify HTTPS works correctly
- Simulate API server unavailability during operator restart:
- Either restart operator while API server is under load/briefly unavailable
- Or in a single-node cluster, reboot the node (race condition between components starting)
- After operator starts, check:
kubectl get secrets -n cilium-secrets- will be emptykubectl logs -n kube-system deployment/cilium-operator | grep -i gateway- will show the error- HTTPS requests to Gateway will fail with TLS handshake errors
Evidence from logs
Cilium operator logs showing the failure:
level=info msg="Checking for required and optional GatewayAPI resources" requiredGVK="[gateway.networking.k8s.io/v1, Kind=gatewayclasses ...]"
level=error msg="Required GatewayAPI resources are not found, please refer to docs for installation instructions" error="Get \"https://172.16.101.101:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/gatewayclasses.gateway.networking.k8s.io\": dial tcp 172.16.101.101:6443: connect: no route to host"
level=info msg=Invoked duration=21.507194468s function="gateway-api.initGatewayAPIController"
Secret sync registration missing Gateway (compare with working state):
# Broken state - Gateway missing from registrations:
level=info msg="Setting up Secret synchronization" registrations="[*v2.CiliumClusterwideNetworkPolicy -> \"cilium-secrets\" *v2.CiliumNetworkPolicy -> \"cilium-secrets\"]"
# Working state (from issue #32596 comment by @ivucica):
level=info msg="Setting up Secret synchronization" registrations="[*v1.Ingress -> \"cilium-secrets\" *v2.CiliumNetworkPolicy -> \"cilium-secrets\" *v2.CiliumClusterwideNetworkPolicy -> \"cilium-secrets\" *v1.Gateway -> \"cilium-secrets\"]"
Cilium agent logs showing Envoy failing to get secrets:
level=info msg="[loading 0 static secret(s)" subsys=envoy-config
level=info msg="[gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret" subsys=envoy-config
level=info msg="[gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret" subsys=envoy-config
level=info msg="[gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret" subsys=envoy-config
Gateway shows healthy but TLS is broken:
$ kubectl get gateway -A
NAMESPACE NAME CLASS ADDRESS PROGRAMMED
kube-system cilium-gateway-internal cilium 172.16.100.250 True
$ curl -v https://myapp.home.example.com/
* TLS handshake, Client hello
* Recv failure: Connection reset by peer
curl: (35) Recv failure: Connection reset by peer
Root Cause Analysis
Looking at the code flow:
initGatewayAPIControllerinpkg/gateway-api/cell.gochecks for Gateway API CRDs at startup- If API server is unreachable, this check fails with network error
- The error is logged but the function returns, and Gateway API controller is never initialized
- Secret sync for Gateway resources is never registered
- There is no retry mechanism or recovery path
Suggested Fix
Options to consider:
- Retry loop: Add retry with exponential backoff for Gateway API CRD discovery
- Watch-based initialization: Use informers/watches that automatically handle reconnection
- Deferred initialization: Initialize Gateway API controller lazily when first Gateway resource is seen
- Health check integration: Report degraded status when Gateway API initialization fails, allowing orchestrators to restart the pod
Cilium Version
Client: 1.18.4 afda2aa9 2025-11-12T10:14:04+00:00 go version go1.24.10 linux/arm64
Daemon: 1.18.4 afda2aa9 2025-11-12T10:14:04+00:00 go version go1.24.10 linux/arm64
Kernel Version
6.17.0-1004-raspi (Ubuntu 25.10, Raspberry Pi ARM64)
Kubernetes Version
Client Version: v1.34.2
Server Version: v1.34.1+k3s1
Regression
Unknown. This behavior appears to exist in at least:
- v1.15.4, v1.15.5, v1.16.0-pre.2 (from issue Gateway API stops working after restart of single node 'cluster' #32596)
- v1.18.4 (this report)
- v1.19.0-pre.2 (from @ivucica comment on Gateway API stops working after restart of single node 'cluster' #32596)
Environment
- Platform: K3s on Raspberry Pi 4 (ARM64)
- Gateway API version: v1.3.0
- TLS certificates managed by cert-manager
- Control plane HA via keepalived VIP (172.16.101.101)
Workaround
Manually restart cilium-operator deployment after ensuring API server is fully available:
kubectl rollout restart deployment/cilium-operator -n kube-systemCilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct