Skip to content

fix(gateway-api): prevent silent Gateway API disable on CRD discovery timeout#44662

Open
aslafy-z wants to merge 1 commit intocilium:mainfrom
aslafy-z:fix/gtwapi-init
Open

fix(gateway-api): prevent silent Gateway API disable on CRD discovery timeout#44662
aslafy-z wants to merge 1 commit intocilium:mainfrom
aslafy-z:fix/gtwapi-init

Conversation

@aslafy-z
Copy link
Copy Markdown

@aslafy-z aslafy-z commented Mar 7, 2026

Summary

Fixes a race condition in Gateway API CRD discovery where the operator silently disables Gateway API instead of restarting when the API server is unreachable for longer than 30 seconds.

When the retry context expires during a checkCRDs call (as opposed to during bo.Wait), the returned error is a context deadline error that isTransientError does not recognize. The code falls through to the "permanent error" branch, returning {Enabled: false} with no error, silently disabling Gateway API and TLS secret synchronization while the operator continues running. HTTPS traffic through Gateway API fails permanently with no self-healing until manual restart.

Changes

  • Add ctx.Err() check before isTransientError in the retry loop to catch context expiry regardless of where it occurs
  • Improve bo.Wait timeout path with explicit logging and Health.Stopped() instead of a bare error return
  • Extract discoverCRDsWithRetry for testability (accepts a context parameter instead of hardcoding a 30s timeout)
  • Add 6 tests covering: success, CRDs not installed, transient retry then success, transient timeout, and the two race condition variants (context cancelled / deadline exceeded)

How it was validated

The new tests for the race condition (_ContextAlreadyCancelled, _ContextDeadlineExceeded) simulate calling discoverCRDsWithRetry with an already-expired context, which is the exact scenario that triggers the bug. Without the ctx.Err() guard, these tests fail because the function returns {Enabled: false} with a nil error instead of a fatal error. The _TransientErrorUntilTimeout test validates that the bo.Wait path also reports the correct health status and wraps the error message. Full go test ./operator/pkg/gateway-api/ suite passes.

Test plan

  • TestDiscoverCRDsWithRetry_Success - CRDs found on first try
  • TestDiscoverCRDsWithRetry_CRDsNotInstalled - permanent error, graceful disable
  • TestDiscoverCRDsWithRetry_TransientErrorThenSuccess - retry recovers
  • TestDiscoverCRDsWithRetry_TransientErrorUntilTimeout - timeout via bo.Wait returns fatal error with proper health status
  • TestDiscoverCRDsWithRetry_ContextAlreadyCancelled - race condition: pre-cancelled context returns fatal error, not silent disable
  • TestDiscoverCRDsWithRetry_ContextDeadlineExceeded - race condition: expired deadline returns fatal error, not silent disable

Fixes: #43130
Relates: #43452

@aslafy-z aslafy-z requested a review from a team as a code owner March 7, 2026 18:05
@aslafy-z aslafy-z requested a review from youngnick March 7, 2026 18:05
@maintainer-s-little-helper
Copy link
Copy Markdown

Commit 8977bd5 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Mar 7, 2026
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Mar 7, 2026
… timeout

When the 30-second CRD discovery retry context expires during a checkCRDs
call (rather than during bo.Wait), the returned context deadline error is
not recognized by isTransientError. This causes the code to fall into the
"permanent error" path, silently disabling Gateway API while the operator
continues running — leaving TLS termination permanently broken with no
self-healing.

Add a ctx.Err() check after checkCRDs returns an error, before checking
isTransientError, to ensure context expiry always results in a fatal
error that crashes the operator. Kubelet then restarts the pod, retrying
Gateway API initialization on the fresh start.

Extract discoverCRDsWithRetry into its own function for testability and
add tests covering the race condition, transient retry, and timeout
scenarios.

Fixes: cilium#43130
Relates: cilium#43452

Signed-off-by: Zadkiel AHARONIAN <hello@zadkiel.fr>
@youngnick
Copy link
Copy Markdown
Contributor

Thanks for this PR @aslafy-z. Could you please clarify if you've used AI assistance with this PR? I find that https://danielmiessler.com/blog/ai-influence-level-ail is a good way to be specific.

@julianwiedmann julianwiedmann added the need-more-info More information is required to further debug or fix the issue. label Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. kind/community-contribution This was a contribution made by a community member. need-more-info More information is required to further debug or fix the issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway API TLS termination fails permanently if API server is temporarily unreachable during operator startup

3 participants