Skip to content

fix: fail fast when unrecoverable discovery errors happens on checking optional CRDs#7872

Merged
zhaohuabing merged 20 commits intoenvoyproxy:mainfrom
zhaohuabing:fix-7871
Jan 15, 2026
Merged

fix: fail fast when unrecoverable discovery errors happens on checking optional CRDs#7872
zhaohuabing merged 20 commits intoenvoyproxy:mainfrom
zhaohuabing:fix-7871

Conversation

@zhaohuabing
Copy link
Copy Markdown
Member

@zhaohuabing zhaohuabing commented Jan 7, 2026

What type of PR is this?

This PR adds retries to the controller when it fails to discover optional CRDs from the API server. If all retries fail, the error is propagated and causes the EG pod to restart. This prevents the EG pod from reconciling incomplete resources and serving partial xDS configuration to Envoy.

It also propagates runner startup errors to the server, so the Envoy Gateway process can exit and restart cleanly. Previously, runner startup failures were only logged, and Envoy Gateway continued running even with failed runners.

Fixes #7871

Release Notes: Yes

@zhaohuabing zhaohuabing requested a review from a team as a code owner January 7, 2026 02:16
@zhaohuabing zhaohuabing marked this pull request as draft January 7, 2026 02:16
@zhaohuabing zhaohuabing changed the title fail fast when unrecoverable discovery errors happens fix: fail fast when unrecoverable discovery errors happens Jan 7, 2026
@zhaohuabing zhaohuabing force-pushed the fix-7871 branch 2 times, most recently from 36e3fe1 to d4af0fb Compare January 7, 2026 02:25
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 48.57143% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.74%. Comparing base (3fd3e4a) to head (29071df).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
internal/provider/kubernetes/controller.go 56.79% 17 Missing and 18 partials ⚠️
internal/cmd/server.go 0.00% 10 Missing ⚠️
...nternal/envoygateway/config/loader/configloader.go 22.22% 7 Missing ⚠️
internal/provider/kubernetes/controller_watch.go 60.00% 1 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (48.57%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7872      +/-   ##
==========================================
- Coverage   72.80%   72.74%   -0.07%     
==========================================
  Files         235      235              
  Lines       35313    35380      +67     
==========================================
+ Hits        25709    25736      +27     
- Misses       7781     7806      +25     
- Partials     1823     1838      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zhaohuabing zhaohuabing changed the title fix: fail fast when unrecoverable discovery errors happens fix: fail fast when unrecoverable discovery errors happens on checking optional CRDs Jan 7, 2026
Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>
Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>
@zhaohuabing zhaohuabing force-pushed the fix-7871 branch 3 times, most recently from ff6bbad to 0e6c3a9 Compare January 7, 2026 07:14
@zhaohuabing zhaohuabing marked this pull request as ready for review January 7, 2026 07:17
Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>
@netlify
Copy link
Copy Markdown

netlify bot commented Jan 8, 2026

Deploy Preview for cerulean-figolla-1f9435 ready!

Name Link
🔨 Latest commit 29071df
🔍 Latest deploy log https://app.netlify.com/projects/cerulean-figolla-1f9435/deploys/69675027f84f4d00088fe9d9
😎 Deploy Preview https://deploy-preview-7872--cerulean-figolla-1f9435.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>
@zhaohuabing zhaohuabing force-pushed the fix-7871 branch 2 times, most recently from 273f904 to f96ea29 Compare January 8, 2026 02:25
Signed-off-by: Huabing(Robin) Zhao <zhaohuabing@gmail.com>
@zhaohuabing zhaohuabing marked this pull request as ready for review January 12, 2026 11:28
@zhaohuabing zhaohuabing requested a review from arkodg January 12, 2026 23:55
@zhaohuabing zhaohuabing added this to the v1.7.0-rc.1 Release milestone Jan 13, 2026
arkodg
arkodg previously approved these changes Jan 14, 2026
Copy link
Copy Markdown
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks

@arkodg arkodg requested review from a team January 14, 2026 05:14
zirain
zirain previously approved these changes Jan 14, 2026
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
@zhaohuabing zhaohuabing merged commit 09b6456 into envoyproxy:main Jan 15, 2026
56 of 59 checks passed
@zhaohuabing zhaohuabing deleted the fix-7871 branch January 15, 2026 01:15
andreik-n2 pushed a commit to andreik-n2/gateway that referenced this pull request Jan 15, 2026
…g optional CRDs (envoyproxy#7872)

* fail fast when unrecoverable discovery errors happens

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* only retry transient errors

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix potenial dead lock

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* minor wording

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* create discovery client once

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix lint

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* remove redundant logging

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* add e2e test

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix test

Signed-off-by: Huabing(Robin) Zhao <zhaohuabing@gmail.com>

* fix test

Signed-off-by: Huabing(Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
@zirain
Copy link
Copy Markdown
Member

zirain commented Jan 15, 2026

FYI, during test #7964, it would take more 60s before runner return error with discovery failure.

@zhaohuabing
Copy link
Copy Markdown
Member Author

FYI, during test #7964, it would take more 60s before runner return error with discovery failure.

Hi @zirain Is this because of the retries in this PR?

@zirain
Copy link
Copy Markdown
Member

zirain commented Jan 23, 2026

FYI, during test #7964, it would take more 60s before runner return error with discovery failure.

Hi @zirain Is this because of the retries in this PR?

I'm not 100% sure, maybe we need a way to disable the retry in test code?

zirain pushed a commit to zirain/gateway that referenced this pull request Jan 26, 2026
…g optional CRDs (envoyproxy#7872)

* fail fast when unrecoverable discovery errors happens

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* only retry transient errors

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix potenial dead lock

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* minor wording

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* create discovery client once

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix lint

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* remove redundant logging

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* add e2e test

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix test

Signed-off-by: Huabing(Robin) Zhao <zhaohuabing@gmail.com>

* fix test

Signed-off-by: Huabing(Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
rudrakhp pushed a commit to rudrakhp/gateway that referenced this pull request Jan 26, 2026
…g optional CRDs (envoyproxy#7872)

* fail fast when unrecoverable discovery errors happens

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* only retry transient errors

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix potenial dead lock

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* minor wording

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* create discovery client once

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix lint

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* remove redundant logging

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* add e2e test

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix test

Signed-off-by: Huabing(Robin) Zhao <zhaohuabing@gmail.com>

* fix test

Signed-off-by: Huabing(Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
zirain added a commit that referenced this pull request Jan 26, 2026
* fix: fail fast when unrecoverable discovery errors happens on checking optional CRDs (#7872)

* fail fast when unrecoverable discovery errors happens

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* only retry transient errors

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix potenial dead lock

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* minor wording

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* create discovery client once

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix lint

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* remove redundant logging

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* add e2e test

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix test

Signed-off-by: Huabing(Robin) Zhao <zhaohuabing@gmail.com>

* fix test

Signed-off-by: Huabing(Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* fix: extproc is discarded with failOpen is enabled for wasm (#7956)

* fix: extproc is discarded with failOpen is enabled for wasm

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* add test

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* polish code

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* add test

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* fix: sanitize control plane config dump (#7901)

* mask secrets

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* fix: server run race (#7964)

* add test

Signed-off-by: zirain <zirain2009@gmail.com>

* fix race

Signed-off-by: zirain <zirain2009@gmail.com>

* fix lint

Signed-off-by: zirain <zirain2009@gmail.com>

* fix

Signed-off-by: zirain <zirain2009@gmail.com>

* fix

Signed-off-by: zirain <zirain2009@gmail.com>

* fix lint

Signed-off-by: zirain <zirain2009@gmail.com>

* use Semaphore instead of WaitGroup

Signed-off-by: zirain <zirain2009@gmail.com>

* comments

Signed-off-by: zirain <zirain2009@gmail.com>

* lint

Signed-off-by: zirain <zirain2009@gmail.com>

* fix

Signed-off-by: zirain <zirain2009@gmail.com>

* fix lint

Signed-off-by: zirain <zirain2009@gmail.com>

* callback

Signed-off-by: zirain <zirain2009@gmail.com>

* fix lint

Signed-off-by: zirain <zirain2009@gmail.com>

* run hook sequentially

Signed-off-by: zirain <zirain2009@gmail.com>

* fix lint

Signed-off-by: zirain <zirain2009@gmail.com>

* rename to cfgMux

Signed-off-by: zirain <zirain2009@gmail.com>

---------

Signed-off-by: zirain <zirain2009@gmail.com>

* fix: wrong cluster type with mixed FQDN backend and service backend refs (#7994)

* fix: wrong cluster type with mixed FQDN backend and service backend refs

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* fix mirror cluster endpoint type

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* simplify the test

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* update comment

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* fix: merge route match rule with match all route (#8011)

Signed-off-by: zirain <zirain2009@gmail.com>

* fix gen

Signed-off-by: zirain <zirain2009@gmail.com>

* fix lint

Signed-off-by: zirain <zirain2009@gmail.com>

* fix for golang 11.24

Signed-off-by: zirain <zirain2009@gmail.com>

* fix lint

Signed-off-by: zirain <zirain2009@gmail.com>

* fix watch CRD version

Signed-off-by: zirain <zirain2009@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>
Co-authored-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
rudrakhp added a commit that referenced this pull request Jan 26, 2026
* fix: extproc is discarded with failOpen is enabled for wasm (#7956)

* fix: extproc is discarded with failOpen is enabled for wasm

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* add test

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* polish code

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* add test

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* fix: sanitize control plane config dump (#7901)

* mask secrets

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* fix: server run race (#7964)

* add test

Signed-off-by: zirain <zirain2009@gmail.com>

* fix race

Signed-off-by: zirain <zirain2009@gmail.com>

* fix lint

Signed-off-by: zirain <zirain2009@gmail.com>

* fix

Signed-off-by: zirain <zirain2009@gmail.com>

* fix

Signed-off-by: zirain <zirain2009@gmail.com>

* fix lint

Signed-off-by: zirain <zirain2009@gmail.com>

* use Semaphore instead of WaitGroup

Signed-off-by: zirain <zirain2009@gmail.com>

* comments

Signed-off-by: zirain <zirain2009@gmail.com>

* lint

Signed-off-by: zirain <zirain2009@gmail.com>

* fix

Signed-off-by: zirain <zirain2009@gmail.com>

* fix lint

Signed-off-by: zirain <zirain2009@gmail.com>

* callback

Signed-off-by: zirain <zirain2009@gmail.com>

* fix lint

Signed-off-by: zirain <zirain2009@gmail.com>

* run hook sequentially

Signed-off-by: zirain <zirain2009@gmail.com>

* fix lint

Signed-off-by: zirain <zirain2009@gmail.com>

* rename to cfgMux

Signed-off-by: zirain <zirain2009@gmail.com>

---------

Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* fix: wrong cluster type with mixed FQDN backend and service backend refs (#7994)

* fix: wrong cluster type with mixed FQDN backend and service backend refs

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* fix mirror cluster endpoint type

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* simplify the test

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

* update comment

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* fix: fail fast when unrecoverable discovery errors happens on checking optional CRDs (#7872)

* fail fast when unrecoverable discovery errors happens

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* only retry transient errors

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix potenial dead lock

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* minor wording

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* create discovery client once

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix lint

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* remove redundant logging

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* add e2e test

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix test

Signed-off-by: Huabing(Robin) Zhao <zhaohuabing@gmail.com>

* fix test

Signed-off-by: Huabing(Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* fix: merge route match rule with match all route (#8011)

Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* fix: do not set autoHTTPConfig when used mixed(HTTP + HTTPS) backends (#7950)

* fix: do not set autoHTTPConfig when used mixed backend

Signed-off-by: zirain <zirain2009@gmail.com>

* release notes

Signed-off-by: zirain <zirain2009@gmail.com>

* fix

Signed-off-by: zirain <zirain2009@gmail.com>

* add e2e

Signed-off-by: zirain <zirain2009@gmail.com>

---------

Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* fix: backend tls default namespace (#7987)

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* fix: race in gatewaapi runner (#8037)

* add testcase

Signed-off-by: zirain <zirain2009@gmail.com>

* fix

Signed-off-by: zirain <zirain2009@gmail.com>

* simply

Signed-off-by: zirain <zirain2009@gmail.com>

---------

Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* [release/v1.6] v1.6.3 release notes (#8054)

Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* v1.6.3 version

Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* fix gen-check

Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

* fix lint

Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>
Co-authored-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Co-authored-by: zirain <zirain2009@gmail.com>
SadmiB pushed a commit to SadmiB/gateway that referenced this pull request Jan 30, 2026
…g optional CRDs (envoyproxy#7872)

* fail fast when unrecoverable discovery errors happens

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* only retry transient errors

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix potenial dead lock

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* minor wording

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* create discovery client once

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix lint

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* address comments

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* remove redundant logging

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* add e2e test

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix test

Signed-off-by: Huabing(Robin) Zhao <zhaohuabing@gmail.com>

* fix test

Signed-off-by: Huabing(Robin) Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: Sadmi Bouhafs <sadmibouhafs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optional CRDs skipped when discovery errors are treated as “absent”

4 participants