fix(gateway-api): prevent silent Gateway API disable on CRD discovery timeout#44662
Open
aslafy-z wants to merge 1 commit intocilium:mainfrom
Open
fix(gateway-api): prevent silent Gateway API disable on CRD discovery timeout#44662aslafy-z wants to merge 1 commit intocilium:mainfrom
aslafy-z wants to merge 1 commit intocilium:mainfrom
Conversation
|
Commit 8977bd5 does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
… timeout When the 30-second CRD discovery retry context expires during a checkCRDs call (rather than during bo.Wait), the returned context deadline error is not recognized by isTransientError. This causes the code to fall into the "permanent error" path, silently disabling Gateway API while the operator continues running — leaving TLS termination permanently broken with no self-healing. Add a ctx.Err() check after checkCRDs returns an error, before checking isTransientError, to ensure context expiry always results in a fatal error that crashes the operator. Kubelet then restarts the pod, retrying Gateway API initialization on the fresh start. Extract discoverCRDsWithRetry into its own function for testability and add tests covering the race condition, transient retry, and timeout scenarios. Fixes: cilium#43130 Relates: cilium#43452 Signed-off-by: Zadkiel AHARONIAN <hello@zadkiel.fr>
8977bd5 to
76e7a5a
Compare
3 tasks
Contributor
|
Thanks for this PR @aslafy-z. Could you please clarify if you've used AI assistance with this PR? I find that https://danielmiessler.com/blog/ai-influence-level-ail is a good way to be specific. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a race condition in Gateway API CRD discovery where the operator silently disables Gateway API instead of restarting when the API server is unreachable for longer than 30 seconds.
When the retry context expires during a
checkCRDscall (as opposed to duringbo.Wait), the returned error is a context deadline error thatisTransientErrordoes not recognize. The code falls through to the "permanent error" branch, returning{Enabled: false}with no error, silently disabling Gateway API and TLS secret synchronization while the operator continues running. HTTPS traffic through Gateway API fails permanently with no self-healing until manual restart.Changes
ctx.Err()check beforeisTransientErrorin the retry loop to catch context expiry regardless of where it occursbo.Waittimeout path with explicit logging andHealth.Stopped()instead of a bare error returndiscoverCRDsWithRetryfor testability (accepts a context parameter instead of hardcoding a 30s timeout)How it was validated
The new tests for the race condition (
_ContextAlreadyCancelled,_ContextDeadlineExceeded) simulate callingdiscoverCRDsWithRetrywith an already-expired context, which is the exact scenario that triggers the bug. Without thectx.Err()guard, these tests fail because the function returns{Enabled: false}with a nil error instead of a fatal error. The_TransientErrorUntilTimeouttest validates that thebo.Waitpath also reports the correct health status and wraps the error message. Fullgo test ./operator/pkg/gateway-api/suite passes.Test plan
TestDiscoverCRDsWithRetry_Success- CRDs found on first tryTestDiscoverCRDsWithRetry_CRDsNotInstalled- permanent error, graceful disableTestDiscoverCRDsWithRetry_TransientErrorThenSuccess- retry recoversTestDiscoverCRDsWithRetry_TransientErrorUntilTimeout- timeout via bo.Wait returns fatal error with proper health statusTestDiscoverCRDsWithRetry_ContextAlreadyCancelled- race condition: pre-cancelled context returns fatal error, not silent disableTestDiscoverCRDsWithRetry_ContextDeadlineExceeded- race condition: expired deadline returns fatal error, not silent disableFixes: #43130
Relates: #43452