Skip to content

Gateway API TLS termination fails permanently if API server is temporarily unreachable during operator startup #43130

@lexfrei

Description

@lexfrei

Gateway API TLS termination fails permanently if API server is temporarily unreachable during operator startup

Is there an existing issue for this?

  • I have searched the existing issues

Related issue: #32596 (closed as stale without resolution)

What happened?

When Cilium operator starts and the Kubernetes API server is temporarily unreachable (even for a few seconds), the Gateway API controller fails to initialize and never recovers. This results in:

  1. TLS secrets not being synchronized to cilium-secrets namespace
  2. Envoy proxy starting with 0 TLS secrets
  3. All HTTPS traffic through Gateway API failing with "Connection reset by peer" during TLS handshake
  4. HTTP traffic continues to work (redirects to HTTPS)
  5. Gateway resources show Programmed: True even though TLS termination is broken

Expected behavior: Cilium operator should retry Gateway API initialization or use a watch-based approach that recovers when API server becomes available.

Actual behavior: Gateway API controller initialization fails once and is never retried. The operator continues running but Gateway API secret sync is permanently broken until operator pod is manually restarted.

How can we reproduce the issue?

  1. Set up a Kubernetes cluster with Cilium and Gateway API enabled
  2. Create a Gateway with TLS termination using cert-manager certificates
  3. Verify HTTPS works correctly
  4. Simulate API server unavailability during operator restart:
    • Either restart operator while API server is under load/briefly unavailable
    • Or in a single-node cluster, reboot the node (race condition between components starting)
  5. After operator starts, check:
    • kubectl get secrets -n cilium-secrets - will be empty
    • kubectl logs -n kube-system deployment/cilium-operator | grep -i gateway - will show the error
    • HTTPS requests to Gateway will fail with TLS handshake errors

Evidence from logs

Cilium operator logs showing the failure:

level=info msg="Checking for required and optional GatewayAPI resources" requiredGVK="[gateway.networking.k8s.io/v1, Kind=gatewayclasses ...]"
level=error msg="Required GatewayAPI resources are not found, please refer to docs for installation instructions" error="Get \"https://172.16.101.101:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/gatewayclasses.gateway.networking.k8s.io\": dial tcp 172.16.101.101:6443: connect: no route to host"
level=info msg=Invoked duration=21.507194468s function="gateway-api.initGatewayAPIController"

Secret sync registration missing Gateway (compare with working state):

# Broken state - Gateway missing from registrations:
level=info msg="Setting up Secret synchronization" registrations="[*v2.CiliumClusterwideNetworkPolicy -> \"cilium-secrets\" *v2.CiliumNetworkPolicy -> \"cilium-secrets\"]"

# Working state (from issue #32596 comment by @ivucica):
level=info msg="Setting up Secret synchronization" registrations="[*v1.Ingress -> \"cilium-secrets\" *v2.CiliumNetworkPolicy -> \"cilium-secrets\" *v2.CiliumClusterwideNetworkPolicy -> \"cilium-secrets\" *v1.Gateway -> \"cilium-secrets\"]"

Cilium agent logs showing Envoy failing to get secrets:

level=info msg="[loading 0 static secret(s)" subsys=envoy-config
level=info msg="[gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret" subsys=envoy-config
level=info msg="[gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret" subsys=envoy-config
level=info msg="[gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret" subsys=envoy-config

Gateway shows healthy but TLS is broken:

$ kubectl get gateway -A
NAMESPACE     NAME                      CLASS    ADDRESS          PROGRAMMED
kube-system   cilium-gateway-internal   cilium   172.16.100.250   True

$ curl -v https://myapp.home.example.com/
* TLS handshake, Client hello
* Recv failure: Connection reset by peer
curl: (35) Recv failure: Connection reset by peer

Root Cause Analysis

Looking at the code flow:

  1. initGatewayAPIController in pkg/gateway-api/cell.go checks for Gateway API CRDs at startup
  2. If API server is unreachable, this check fails with network error
  3. The error is logged but the function returns, and Gateway API controller is never initialized
  4. Secret sync for Gateway resources is never registered
  5. There is no retry mechanism or recovery path

Suggested Fix

Options to consider:

  1. Retry loop: Add retry with exponential backoff for Gateway API CRD discovery
  2. Watch-based initialization: Use informers/watches that automatically handle reconnection
  3. Deferred initialization: Initialize Gateway API controller lazily when first Gateway resource is seen
  4. Health check integration: Report degraded status when Gateway API initialization fails, allowing orchestrators to restart the pod

Cilium Version

Client: 1.18.4 afda2aa9 2025-11-12T10:14:04+00:00 go version go1.24.10 linux/arm64
Daemon: 1.18.4 afda2aa9 2025-11-12T10:14:04+00:00 go version go1.24.10 linux/arm64

Kernel Version

6.17.0-1004-raspi (Ubuntu 25.10, Raspberry Pi ARM64)

Kubernetes Version

Client Version: v1.34.2
Server Version: v1.34.1+k3s1

Regression

Unknown. This behavior appears to exist in at least:

Environment

  • Platform: K3s on Raspberry Pi 4 (ARM64)
  • Gateway API version: v1.3.0
  • TLS certificates managed by cert-manager
  • Control plane HA via keepalived VIP (172.16.101.101)

Workaround

Manually restart cilium-operator deployment after ensuring API server is fully available:

kubectl rollout restart deployment/cilium-operator -n kube-system

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/agentCilium agent related.area/servicemeshGH issues or PRs regarding servicemeshfeature/k8s-gateway-apikind/regressionThis functionality worked fine before, but was broken in a newer release of Cilium.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions