-
Notifications
You must be signed in to change notification settings - Fork 708
Description
Description:
After upgrading from v1.3.2 to v1.4.0 we started to see filter_chain_not_found and cluster_not_found errors in envoy access logs.
The errors disappeared after few hours.
The errors started with a specific xDS update and stopped after another update.
We discovered a single error log exactly on the time the errors started (filter chain not found):
"ERROR provider kubernetes/controller.go:237 failed processGateways for gatewayClass eg, skipping it {"runner": "provider", "error": "failed to list ***, Kind=***: etcdserver: leader changed"}"
We noticed #5953 which replaced all return reconcile.Result{}, err statements with continue statements in Reconcile function.
This explains the behavior we observed since gateway resources are stored here regardless of errors during reconcile process which can lead to a storage of a partial state like in our case.
Furthermore, reconcile.Result{}, nil is always returned from Reconcile function regardless of any errors in the process, which means controller-runtime backoff mechanism will not be triggered. In case of a transient error(e.g. etcd error like in our case) we'll have to wait for some watch event so Reconcile will be called again.
Repro steps:
Produce some transient error in the middle of Reconcile process.
Environment:
EG v1.4.0