-
Notifications
You must be signed in to change notification settings - Fork 711
Context cancelled not treated as transient, causing unintended Envoy-Proxy recreation #6849
Description
Description:
We encountered an unexpected recreation of the Envoy-Proxy deployment, which caused traffic disruption. After investigating, we found that a network issue triggered Envoy-Gateway to recreate the Envoy-Proxy deployment.
The logs indicate that while processingGateways, the system received a context cancelled error. This error was not handled as a transient error, which caused the remaining logic to incorrectly assume that zero Gateways existed, and eventually led to the deletion of the Envoy-Proxy deployment.
Expected: These context errors (deadline exceeded or context cancelled) should be treated as transient, allowing retry.
Actual: They are treated as non-transient, which may cause unexpected failures—such as the issue we experienced and the one reported here.
Repro steps:
- Send a context with a tight timeout or manually cancel the request context.
- Observe that the context returns
deadline exceededorcontext cancelled. - The system doesn't classify this as a transient error and continuing with a corrupted truth.
Environment:
EG v1.4.2
Logs:
An error correctly marked as a transient error:
2025-08-25T19:34:24.009Z ERROR provider kubernetes/controller.go:295 transient error processing gateways {"runner": "provider", "gatewayClass": "eg", "error": "failed to list : etcdserver: leader changed"}
Examples of errors not marked as transient:
2025-08-25T14:47:35.097Z ERROR provider kubernetes/controller.go:298 failed processGateways for gatewayClass eg, skipping it {"runner": "provider", "error": "failed to list : client rate limiter Wait returned an error: context canceled"}
2025-08-25T14:47:31.632Z ERROR provider kubernetes/controller.go:298 failed processGateways for gatewayClass eg, skipping it {"runner": "provider", "error": "failed to list : context canceled"}