Gateway API TLS termination fails permanently if API server is temporarily unreachable during operator startup

# Gateway API TLS termination fails permanently if API server is temporarily unreachable during operator startup

## Is there an existing issue for this?

- [x] I have searched the existing issues

**Related issue**: #32596 (closed as stale without resolution)

## What happened?

When Cilium operator starts and the Kubernetes API server is temporarily unreachable (even for a few seconds), the Gateway API controller fails to initialize and **never recovers**. This results in:

1. TLS secrets not being synchronized to `cilium-secrets` namespace
2. Envoy proxy starting with 0 TLS secrets
3. All HTTPS traffic through Gateway API failing with "Connection reset by peer" during TLS handshake
4. HTTP traffic continues to work (redirects to HTTPS)
5. Gateway resources show `Programmed: True` even though TLS termination is broken

**Expected behavior**: Cilium operator should retry Gateway API initialization or use a watch-based approach that recovers when API server becomes available.

**Actual behavior**: Gateway API controller initialization fails once and is never retried. The operator continues running but Gateway API secret sync is permanently broken until operator pod is manually restarted.

## How can we reproduce the issue?

1. Set up a Kubernetes cluster with Cilium and Gateway API enabled
2. Create a Gateway with TLS termination using cert-manager certificates
3. Verify HTTPS works correctly
4. Simulate API server unavailability during operator restart:
   - Either restart operator while API server is under load/briefly unavailable
   - Or in a single-node cluster, reboot the node (race condition between components starting)
5. After operator starts, check:
   - `kubectl get secrets -n cilium-secrets` - will be empty
   - `kubectl logs -n kube-system deployment/cilium-operator | grep -i gateway` - will show the error
   - HTTPS requests to Gateway will fail with TLS handshake errors

## Evidence from logs

**Cilium operator logs showing the failure:**
```
level=info msg="Checking for required and optional GatewayAPI resources" requiredGVK="[gateway.networking.k8s.io/v1, Kind=gatewayclasses ...]"
level=error msg="Required GatewayAPI resources are not found, please refer to docs for installation instructions" error="Get \"https://172.16.101.101:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/gatewayclasses.gateway.networking.k8s.io\": dial tcp 172.16.101.101:6443: connect: no route to host"
level=info msg=Invoked duration=21.507194468s function="gateway-api.initGatewayAPIController"
```

**Secret sync registration missing Gateway (compare with working state):**
```
# Broken state - Gateway missing from registrations:
level=info msg="Setting up Secret synchronization" registrations="[*v2.CiliumClusterwideNetworkPolicy -> \"cilium-secrets\" *v2.CiliumNetworkPolicy -> \"cilium-secrets\"]"

# Working state (from issue #32596 comment by @ivucica):
level=info msg="Setting up Secret synchronization" registrations="[*v1.Ingress -> \"cilium-secrets\" *v2.CiliumNetworkPolicy -> \"cilium-secrets\" *v2.CiliumClusterwideNetworkPolicy -> \"cilium-secrets\" *v1.Gateway -> \"cilium-secrets\"]"
```

**Cilium agent logs showing Envoy failing to get secrets:**
```
level=info msg="[loading 0 static secret(s)" subsys=envoy-config
level=info msg="[gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret" subsys=envoy-config
level=info msg="[gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret" subsys=envoy-config
level=info msg="[gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret" subsys=envoy-config
```

**Gateway shows healthy but TLS is broken:**
```
$ kubectl get gateway -A
NAMESPACE     NAME                      CLASS    ADDRESS          PROGRAMMED
kube-system   cilium-gateway-internal   cilium   172.16.100.250   True

$ curl -v https://myapp.home.example.com/
* TLS handshake, Client hello
* Recv failure: Connection reset by peer
curl: (35) Recv failure: Connection reset by peer
```

## Root Cause Analysis

Looking at the code flow:
1. `initGatewayAPIController` in `pkg/gateway-api/cell.go` checks for Gateway API CRDs at startup
2. If API server is unreachable, this check fails with network error
3. The error is logged but the function returns, and Gateway API controller is never initialized
4. Secret sync for Gateway resources is never registered
5. There is no retry mechanism or recovery path

## Suggested Fix

Options to consider:
1. **Retry loop**: Add retry with exponential backoff for Gateway API CRD discovery
2. **Watch-based initialization**: Use informers/watches that automatically handle reconnection
3. **Deferred initialization**: Initialize Gateway API controller lazily when first Gateway resource is seen
4. **Health check integration**: Report degraded status when Gateway API initialization fails, allowing orchestrators to restart the pod

## Cilium Version

```
Client: 1.18.4 afda2aa9 2025-11-12T10:14:04+00:00 go version go1.24.10 linux/arm64
Daemon: 1.18.4 afda2aa9 2025-11-12T10:14:04+00:00 go version go1.24.10 linux/arm64
```

## Kernel Version

```
6.17.0-1004-raspi (Ubuntu 25.10, Raspberry Pi ARM64)
```

## Kubernetes Version

```
Client Version: v1.34.2
Server Version: v1.34.1+k3s1
```

## Regression

Unknown. This behavior appears to exist in at least:
- v1.15.4, v1.15.5, v1.16.0-pre.2 (from issue #32596)
- v1.18.4 (this report)
- v1.19.0-pre.2 (from @ivucica comment on #32596)

## Environment

- Platform: K3s on Raspberry Pi 4 (ARM64)
- Gateway API version: v1.3.0
- TLS certificates managed by cert-manager
- Control plane HA via keepalived VIP (172.16.101.101)

## Workaround

Manually restart cilium-operator deployment after ensuring API server is fully available:
```bash
kubectl rollout restart deployment/cilium-operator -n kube-system
```

## Cilium Users Document

- [ ] Are you a user of Cilium? Please add yourself to the [Users doc](https://github.com/cilium/cilium/blob/main/USERS.md)

## Code of Conduct

- [x] I agree to follow this project's Code of Conduct


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway API TLS termination fails permanently if API server is temporarily unreachable during operator startup #43130

Gateway API TLS termination fails permanently if API server is temporarily unreachable during operator startup

Is there an existing issue for this?

What happened?

How can we reproduce the issue?

Evidence from logs

Root Cause Analysis

Suggested Fix

Cilium Version

Kernel Version

Kubernetes Version

Regression

Environment

Workaround

Cilium Users Document

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gateway API TLS termination fails permanently if API server is temporarily unreachable during operator startup #43130

Description

Gateway API TLS termination fails permanently if API server is temporarily unreachable during operator startup

Is there an existing issue for this?

What happened?

How can we reproduce the issue?

Evidence from logs

Root Cause Analysis

Suggested Fix

Cilium Version

Kernel Version

Kubernetes Version

Regression

Environment

Workaround

Cilium Users Document

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions