Skip to content

fix(gateway-api): add retry with backoff for CRD discovery on transient errors#43452

Merged
julianwiedmann merged 1 commit intocilium:mainfrom
lexfrei:fix/gateway-api-retry-on-transient-errors
Dec 23, 2025
Merged

fix(gateway-api): add retry with backoff for CRD discovery on transient errors#43452
julianwiedmann merged 1 commit intocilium:mainfrom
lexfrei:fix/gateway-api-retry-on-transient-errors

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei lexfrei commented Dec 19, 2025

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer's Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

Description

When cilium-operator starts and the Kubernetes API server is temporarily unreachable, Gateway API CRD discovery fails and never recovers. This results in permanent TLS termination failure because TLS secrets are not synchronized to the cilium-secrets namespace.

This PR adds retry with exponential backoff for Gateway API CRD discovery:

  • Error classification: Distinguishes transient errors (network issues, API server overload) from permanent errors (CRDs not installed)
  • Retry loop: Exponential backoff (1s-2min) for transient errors only
  • Health reporting: Reports controller status via cell.Health for observability
  • Graceful degradation: Permanent errors (missing CRDs) exit immediately without infinite retries

Fixes: #43130

gateway-api: Add retry with exponential backoff for CRD discovery when API server is temporarily unreachable during operator startup

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 19, 2025
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Dec 19, 2025
@lexfrei lexfrei marked this pull request as ready for review December 19, 2025 21:54
@lexfrei lexfrei requested a review from a team as a code owner December 19, 2025 21:54
@lexfrei lexfrei requested a review from mhofstetter December 19, 2025 21:54
Copy link
Copy Markdown
Member

@mhofstetter mhofstetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the contribution!

I left some suggestions inline - my main concern is that we don't handle initGatewayAPIController & registerSecretSync the same. Please let me know what you think.

@mhofstetter mhofstetter added kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/servicemesh GH issues or PRs regarding servicemesh feature/k8s-gateway-api labels Dec 22, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 22, 2025
@lexfrei lexfrei requested a review from mhofstetter December 22, 2025 11:16
Copy link
Copy Markdown
Member

@mhofstetter mhofstetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extraction of the preconditions. We still have to set a timeout for the context that is used in the retry.

@lexfrei lexfrei requested a review from mhofstetter December 22, 2025 13:17
@lexfrei lexfrei force-pushed the fix/gateway-api-retry-on-transient-errors branch from 1801002 to 5bc4bb5 Compare December 22, 2025 13:30
@maintainer-s-little-helper
Copy link
Copy Markdown

Commit 9cfeb6e does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Dec 22, 2025
Add retry with exponential backoff for Gateway API CRD discovery to handle
transient API server errors during operator startup. This prevents permanent
failure of Gateway API TLS termination when API server is temporarily
unreachable.

Changes:
- Extract preconditions into private hive component (cell.ProvidePrivate)
- Add isTransientError() to distinguish retryable vs permanent errors
- Configure backoff with 200ms-5s range and 30s context timeout
- Report health status during retry attempts
- Ensure consistent behavior between initGatewayAPIController and
  registerSecretSync by sharing preconditions

Fixes: #43130

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@lexfrei lexfrei force-pushed the fix/gateway-api-retry-on-transient-errors branch from 9cfeb6e to 0170f19 Compare December 22, 2025 13:36
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Dec 22, 2025
@mhofstetter
Copy link
Copy Markdown
Member

/test

Copy link
Copy Markdown
Member

@mhofstetter mhofstetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks again!

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 22, 2025
@julianwiedmann julianwiedmann added this pull request to the merge queue Dec 23, 2025
Merged via the queue into cilium:main with commit 2c59ba7 Dec 23, 2025
76 checks passed
@cilium-release-bot cilium-release-bot bot moved this to Released in cilium v1.19.0 Feb 3, 2026
aslafy-z added a commit to aslafy-z/cilium that referenced this pull request Mar 7, 2026
… timeout

When the 30-second CRD discovery retry context expires during a checkCRDs
call (rather than during bo.Wait), the returned context deadline error is
not recognized by isTransientError. This causes the code to fall into the
"permanent error" path, silently disabling Gateway API while the operator
continues running — leaving TLS termination permanently broken with no
self-healing.

Add a ctx.Err() check after checkCRDs returns an error, before checking
isTransientError, to ensure context expiry always results in a fatal
error that crashes the operator. Kubelet then restarts the pod, retrying
Gateway API initialization on the fresh start.

Extract discoverCRDsWithRetry into its own function for testability and
add tests covering the race condition, transient retry, and timeout
scenarios.

Fixes: cilium#43130
Relates: cilium#43452

Signed-off-by: Zadkiel AHARONIAN <hello@zadkiel.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/servicemesh GH issues or PRs regarding servicemesh feature/k8s-gateway-api kind/community-contribution This was a contribution made by a community member. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

Gateway API TLS termination fails permanently if API server is temporarily unreachable during operator startup

3 participants