fix(gateway-api): add retry with backoff for CRD discovery on transient errors#43452
Merged
julianwiedmann merged 1 commit intocilium:mainfrom Dec 23, 2025
Merged
Conversation
mhofstetter
requested changes
Dec 22, 2025
Member
mhofstetter
left a comment
There was a problem hiding this comment.
Thanks a lot for the contribution!
I left some suggestions inline - my main concern is that we don't handle initGatewayAPIController & registerSecretSync the same. Please let me know what you think.
mhofstetter
requested changes
Dec 22, 2025
Member
mhofstetter
left a comment
There was a problem hiding this comment.
Thanks for the extraction of the preconditions. We still have to set a timeout for the context that is used in the retry.
1801002 to
5bc4bb5
Compare
|
Commit 9cfeb6e does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
Add retry with exponential backoff for Gateway API CRD discovery to handle transient API server errors during operator startup. This prevents permanent failure of Gateway API TLS termination when API server is temporarily unreachable. Changes: - Extract preconditions into private hive component (cell.ProvidePrivate) - Add isTransientError() to distinguish retryable vs permanent errors - Configure backoff with 200ms-5s range and 30s context timeout - Report health status during retry attempts - Ensure consistent behavior between initGatewayAPIController and registerSecretSync by sharing preconditions Fixes: #43130 Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
9cfeb6e to
0170f19
Compare
Member
|
/test |
3 tasks
6 tasks
aslafy-z
added a commit
to aslafy-z/cilium
that referenced
this pull request
Mar 7, 2026
… timeout When the 30-second CRD discovery retry context expires during a checkCRDs call (rather than during bo.Wait), the returned context deadline error is not recognized by isTransientError. This causes the code to fall into the "permanent error" path, silently disabling Gateway API while the operator continues running — leaving TLS termination permanently broken with no self-healing. Add a ctx.Err() check after checkCRDs returns an error, before checking isTransientError, to ensure context expiry always results in a fatal error that crashes the operator. Kubelet then restarts the pod, retrying Gateway API initialization on the fresh start. Extract discoverCRDsWithRetry into its own function for testability and add tests covering the race condition, transient retry, and timeout scenarios. Fixes: cilium#43130 Relates: cilium#43452 Signed-off-by: Zadkiel AHARONIAN <hello@zadkiel.fr>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXXline if the commit addresses a particularGitHub issue.
Fixes: <commit-id>tag, thenplease add the commit author[s] as reviewer[s] to this issue.
Description
When cilium-operator starts and the Kubernetes API server is temporarily unreachable, Gateway API CRD discovery fails and never recovers. This results in permanent TLS termination failure because TLS secrets are not synchronized to the
cilium-secretsnamespace.This PR adds retry with exponential backoff for Gateway API CRD discovery:
cell.Healthfor observabilityFixes: #43130