Ignore reconciliation errors caused by namespace termination#4176
Ignore reconciliation errors caused by namespace termination#4176agunnerson-elastic wants to merge 1 commit intoexternal-secrets:mainfrom
Conversation
When namespace finalizers are used, it's pretty easy to hit a race condition where a namespace is deleted while the ClusterExternalSecrets and ExternalSecrets reconciliation loops are running. This commit excludes errors caused by namespace deletion from being logged or counting towards the error metrics since they are not actionable and are "expected". Signed-off-by: Andrew Gunnerson <andrew.gunnerson@elastic.co>
|
| if apierrors.IsNotFound(updateErr) { | ||
| return | ||
| } |
There was a problem hiding this comment.
How are we sure this not found relates to a deleted namespace?
There was a problem hiding this comment.
That's a good point. Maybe this should just log an informational message instead. (Eg. like what Kubernetes does for builtin resources like replicasets: https://github.com/kubernetes/kubernetes/blob/v1.31.3/pkg/controller/replicaset/replica_set.go#L687-L691)
There was a problem hiding this comment.
this is different than the replicaset code you've set - it is for an updateError - it means we were able to get an ExternalSecret, and while performing an update, the ExternalSecret was deleted (possibly due to namespace failure).
This is the equivalent RS implementation https://github.com/kubernetes/kubernetes/blob/v1.31.3/pkg/controller/replicaset/replica_set.go#L728-L733
|
Hi @agunnerson-elastic ! Even though I added some comments on the review - I am not sure this is really something we should ignore. From the flow perspective:
For me ☝️ this is an error condition that we should monitor & Requeue ASAP to get a new set of namespaces with the right labels (now with one namespace less). I get the whole ephemerality of the env you're running, but this is going to go to everywhere even when namespaces are not ephemeral 😄 wouldn't you want to monitor this failure condition if that was the case? edit: can't we instead perform this filter for ns not marked as deletion on |
|
Also, what's exactly the race condition here? From my reading, it is just another reconciliation loop for both CES and ES until things are cleared out, which is generating the errors. Is my understanding correct? Or is this a race condition between CES and ES? |
|
Thanks for reviewing this, @gusfcarvalho!
That's fair. I suppose from my point of view, ideally, the behavior of ESO would be the same if a namespace is deleted when the reconciliation loop is not running vs. when it is running. This is where I took inspiration from for this approach:
A deployment will silently avoid creating a replicaset and a replicaset will silently avoid creating a pod if the namespace is terminating/terminated. It doesn't seem like clusterexternalsecret creating externalsecret is much different from these scenarios.
Unfortunately not. That's how I originally tried implementing it. We ran into issues where the namespace was still "alive" during
Sorry I was unclear. Yep, this is all I meant. |
|
I still wish for something a bit more native and easier to understand. Perhaps we could just add finalizers to each namespace bound to ClusterExternalSecret or finalizers to the ExternalSecret objects themselves. That ☝️ would prevent TOCTOU for sure. The only case that wouldn't be covered is the namespace that still didnt have an ExternalSecret object upon creation - which again I think we should error 😄. IMO there is a difference between erroring on a replicaset (which is bound to one namespace) and a CES (which target several namespaces via a selector). The better analogy IMO is if pods with that specific label were removed and then the RS controller needed to figure out what to do 😄. |
|
This pr is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days. |
|
hi @agunnerson-elastic ! were you able to check my point? Is this something you're still working on? |
Hey @gusfcarvalho! Sorry for disappearing and not replying sooner. I'll look into trying your approach of using finalizers to prevent TOCTOU issues in the coming months. |
|
No worries! it's the nature of Open source! LMK how it goes 😄 |
|
I believe this PR will solve my issue. I’m defining a apiVersion: external-secrets.io/v1
kind: ClusterExternalSecret
metadata:
name: ces-name
namespace: {{ .Release.Namespace }}
spec:
namespaceSelectors:
- matchLabels: {}
...My intention is to deploy the same secret to every namespace. However, whenever a namespace is deleted, CES keeps trying to recreate the ExternalSecret in that terminating namespace, which prevents it from ever finishing termination. CES reports this error: status:
...
failedNamespaces:
- namespace: acme-ns
reason: >-
could not create or update ExternalSecret:
externalsecrets.external-secrets.io "acme-secret" is
forbidden: unable to create new content in namespace acme-ns because
it is being terminated |
Replace error suppression approach with proper finalizer-based solution to handle namespace deletion race conditions in ClusterExternalSecret and ExternalSecret controllers. Changes: - Add finalizers to ClusterExternalSecret for proper cleanup ordering - Add finalizers to ExternalSecret objects for managed secret cleanup - Add namespace finalizers to prevent deletion while CES is bound - Check namespace existence and termination status before processing - Remove error suppression logic (IsNamespaceGone function) This approach prevents race conditions by ensuring resources are cleaned up in the correct order, rather than just hiding error symptoms. Fixes the issues raised in external-secrets#4176
Replace error suppression approach with proper finalizer-based solution to handle namespace deletion race conditions in ClusterExternalSecret and ExternalSecret controllers. Changes: - Add finalizers to ClusterExternalSecret for proper cleanup ordering - Add finalizers to ExternalSecret objects for managed secret cleanup - Add namespace finalizers to prevent deletion while CES is bound - Check namespace existence and termination status before processing - Remove error suppression logic (IsNamespaceGone function) This approach prevents race conditions by ensuring resources are cleaned up in the correct order, rather than just hiding error symptoms. Fixes the issues raised in external-secrets#4176 Signed-off-by: framsouza <fram.souza14@gmail.com>
| if !ctrlerrors.IsNamespaceGone(err) { | ||
| log.Error(err, "failed to create or update external secret") | ||
| failedNamespaces[namespace.Name] = err | ||
| } |
There was a problem hiding this comment.
nit: maybe it makes sense that if we do have gone namespace, we log a a debug message
| apierrors "k8s.io/apimachinery/pkg/api/errors" | ||
| ) | ||
|
|
||
| func IsNamespaceGone(err error) bool { |
There was a problem hiding this comment.
should this really be part of pkg or is this a purely internal function?
| if ctrlerrors.IsNamespaceGone(err) { | ||
| err = nil | ||
| } else { | ||
| r.markAsFailed(msgErrorUpdateSecret, err, externalSecret, syncCallsError.With(resourceLabels)) | ||
| } |
There was a problem hiding this comment.
nit: cleanup code path as such
if ctrlerrors.IsNamespaceGone(err) {
return ctrl.Result{}, nil
}
r.markAsFailed(msgErrorUpdateSecret, err, externalSecret, syncCallsError.With(resourceLabels))
return ctrl.Result{}, err
|
|
||
| // Ensure that the 404 is for a namespace. | ||
| status, ok := err.(apierrors.APIStatus) | ||
| if (ok || errors.As(err, &status)) && status.Status().Details != nil { |
There was a problem hiding this comment.
this if condition broke my head :D why do you need to call errors.As(err, &status)) here if we already type asserted on it with status, ok := err.(apierrors.APIStatus). Is there ever a case where this is not true if ok is true as well?
|
I think this PR can be closed in favor of #5154 |
|
@agunnerson-elastic are you okay with us closing this in favor of that? |
|
closing as discussed. please reopen if needed later |



Problem Statement
When namespace finalizers are used, it's pretty easy to hit a race condition where a namespace is deleted while the
ClusterExternalSecretandExternalSecretreconciliation loops are running. When namespaces are regularly being created and deleted, this causes ESO to report a large number of errors, both in the logs and in the prometheus metrics, making it more difficult to identify and monitor actual issues.Related Issue
(None reported)
Proposed Changes
This commit prevents errors caused by namespace deletion from being logged or counting towards the error metrics. These errors are "expected" to happen during namespace deletion and don't represent an issue with the service.
This is the same approach used in Kubernetes itself, for example in the
Deployment,Job, andEndpointSlicecontrollers.Checklist
git commit --signoffmake testmake reviewable