Ignore reconciliation errors caused by namespace termination by agunnerson-elastic · Pull Request #4176 · external-secrets/external-secrets

agunnerson-elastic · 2024-12-04T20:20:31Z

Problem Statement

When namespace finalizers are used, it's pretty easy to hit a race condition where a namespace is deleted while the ClusterExternalSecret and ExternalSecret reconciliation loops are running. When namespaces are regularly being created and deleted, this causes ESO to report a large number of errors, both in the logs and in the prometheus metrics, making it more difficult to identify and monitor actual issues.

Related Issue

(None reported)

Proposed Changes

This commit prevents errors caused by namespace deletion from being logged or counting towards the error metrics. These errors are "expected" to happen during namespace deletion and don't represent an issue with the service.

This is the same approach used in Kubernetes itself, for example in the Deployment, Job, and EndpointSlice controllers.

Checklist

I have read the contribution guidelines
All commits are signed with git commit --signoff
My changes have reasonable test coverage
- Happy to get some suggestions for this (if tests are needed for this change). Not sure if there's a good way to test for the presence or absence of a log message.
All tests pass with make test
I ensured my PR is ready for review with make reviewable

When namespace finalizers are used, it's pretty easy to hit a race condition where a namespace is deleted while the ClusterExternalSecrets and ExternalSecrets reconciliation loops are running. This commit excludes errors caused by namespace deletion from being logged or counting towards the error metrics since they are not actionable and are "expected". Signed-off-by: Andrew Gunnerson <andrew.gunnerson@elastic.co>

sonarqubecloud · 2024-12-04T20:21:19Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

gusfcarvalho · 2024-12-05T08:54:10Z

pkg/controllers/externalsecret/externalsecret_controller.go

+		if apierrors.IsNotFound(updateErr) {
+			return
+		}


How are we sure this not found relates to a deleted namespace?

That's a good point. Maybe this should just log an informational message instead. (Eg. like what Kubernetes does for builtin resources like replicasets: https://github.com/kubernetes/kubernetes/blob/v1.31.3/pkg/controller/replicaset/replica_set.go#L687-L691)

this is different than the replicaset code you've set - it is for an updateError - it means we were able to get an ExternalSecret, and while performing an update, the ExternalSecret was deleted (possibly due to namespace failure).

This is the equivalent RS implementation https://github.com/kubernetes/kubernetes/blob/v1.31.3/pkg/controller/replicaset/replica_set.go#L728-L733

pkg/controllers/externalsecret/externalsecret_controller.go

gusfcarvalho · 2024-12-05T09:11:01Z

Hi @agunnerson-elastic !

Even though I added some comments on the review - I am not sure this is really something we should ignore.

From the flow perspective:

CES identifies X namespaces that are rightfully labeled so to get an ExternalSecret;
A namespace gets deleted meanwhile CES is still creating the ExternalSecret;

For me ☝️ this is an error condition that we should monitor & Requeue ASAP to get a new set of namespaces with the right labels (now with one namespace less).

I get the whole ephemerality of the env you're running, but this is going to go to everywhere even when namespaces are not ephemeral 😄 wouldn't you want to monitor this failure condition if that was the case?

edit: can't we instead perform this filter for ns not marked as deletion on getTargetNamespaces? This way the reconciler definition would always be known and these errors would not occur due to race conditions at all.

gusfcarvalho · 2024-12-05T09:14:00Z

Also, what's exactly the race condition here?

From my reading, it is just another reconciliation loop for both CES and ES until things are cleared out, which is generating the errors. Is my understanding correct?

Or is this a race condition between CES and ES?

agunnerson-elastic · 2024-12-05T17:16:19Z

Thanks for reviewing this, @gusfcarvalho!

For me ☝️ this is an error condition that we should monitor & Requeue ASAP to get a new set of namespaces with the right labels (now with one namespace less).

That's fair. I suppose from my point of view, ideally, the behavior of ESO would be the same if a namespace is deleted when the reconciliation loop is not running vs. when it is running.

This is where I took inspiration from for this approach:

A deployment will silently avoid creating a replicaset and a replicaset will silently avoid creating a pod if the namespace is terminating/terminated. It doesn't seem like clusterexternalsecret creating externalsecret is much different from these scenarios.

edit: can't we instead perform this filter for ns not marked as deletion on getTargetNamespaces? This way the reconciler definition would always be known and these errors would not occur due to race conditions at all.

Unfortunately not. That's how I originally tried implementing it. We ran into issues where the namespace was still "alive" during getTargetNamespaces, but was already gone before createOrUpdateExternalSecret, which would fail. I think the only way to avoid TOCTOU issues is to check the errors after attempting to perform the operation.

From my reading, it is just another reconciliation loop for both CES and ES until things are cleared out, which is generating the errors. Is my understanding correct?

Sorry I was unclear. Yep, this is all I meant.

gusfcarvalho · 2024-12-09T08:51:23Z

I still wish for something a bit more native and easier to understand. Perhaps we could just add finalizers to each namespace bound to ClusterExternalSecret or finalizers to the ExternalSecret objects themselves.

That ☝️ would prevent TOCTOU for sure.

The only case that wouldn't be covered is the namespace that still didnt have an ExternalSecret object upon creation - which again I think we should error 😄. IMO there is a difference between erroring on a replicaset (which is bound to one namespace) and a CES (which target several namespaces via a selector). The better analogy IMO is if pods with that specific label were removed and then the RS controller needed to figure out what to do 😄.

github-actions · 2025-03-10T01:50:54Z

This pr is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

gusfcarvalho · 2025-03-24T08:37:59Z

hi @agunnerson-elastic ! were you able to check my point? Is this something you're still working on?

agunnerson-elastic · 2025-06-03T02:01:46Z

hi @agunnerson-elastic ! were you able to check my point? Is this something you're still working on?

Hey @gusfcarvalho! Sorry for disappearing and not replying sooner. I'll look into trying your approach of using finalizers to prevent TOCTOU issues in the coming months.

gusfcarvalho · 2025-06-03T14:50:17Z

No worries! it's the nature of Open source! LMK how it goes 😄

carlos-ffs · 2025-07-22T14:08:16Z

I believe this PR will solve my issue. I’m defining a ClusterExternalSecret like this:

apiVersion: external-secrets.io/v1
kind: ClusterExternalSecret
metadata:
  name: ces-name
  namespace: {{ .Release.Namespace }}
spec:
  namespaceSelectors:
    - matchLabels: {}
...

My intention is to deploy the same secret to every namespace. However, whenever a namespace is deleted, CES keeps trying to recreate the ExternalSecret in that terminating namespace, which prevents it from ever finishing termination. CES reports this error:

status:
  ...
  failedNamespaces:
    - namespace: acme-ns
      reason: >-
        could not create or update ExternalSecret:
        externalsecrets.external-secrets.io "acme-secret" is
        forbidden: unable to create new content in namespace acme-ns because
        it is being terminated

Replace error suppression approach with proper finalizer-based solution to handle namespace deletion race conditions in ClusterExternalSecret and ExternalSecret controllers. Changes: - Add finalizers to ClusterExternalSecret for proper cleanup ordering - Add finalizers to ExternalSecret objects for managed secret cleanup - Add namespace finalizers to prevent deletion while CES is bound - Check namespace existence and termination status before processing - Remove error suppression logic (IsNamespaceGone function) This approach prevents race conditions by ensuring resources are cleaned up in the correct order, rather than just hiding error symptoms. Fixes the issues raised in external-secrets#4176

Replace error suppression approach with proper finalizer-based solution to handle namespace deletion race conditions in ClusterExternalSecret and ExternalSecret controllers. Changes: - Add finalizers to ClusterExternalSecret for proper cleanup ordering - Add finalizers to ExternalSecret objects for managed secret cleanup - Add namespace finalizers to prevent deletion while CES is bound - Check namespace existence and termination status before processing - Remove error suppression logic (IsNamespaceGone function) This approach prevents race conditions by ensuring resources are cleaned up in the correct order, rather than just hiding error symptoms. Fixes the issues raised in external-secrets#4176 Signed-off-by: framsouza <fram.souza14@gmail.com>

jakobmoellerdev · 2025-08-20T11:34:23Z

pkg/controllers/clusterexternalsecret/clusterexternalsecret_controller.go

+			if !ctrlerrors.IsNamespaceGone(err) {
+				log.Error(err, "failed to create or update external secret")
+				failedNamespaces[namespace.Name] = err
+			}


nit: maybe it makes sense that if we do have gone namespace, we log a a debug message

jakobmoellerdev · 2025-08-20T11:34:57Z

pkg/controllers/errors/errors.go

+	apierrors "k8s.io/apimachinery/pkg/api/errors"
+)
+
+func IsNamespaceGone(err error) bool {


should this really be part of pkg or is this a purely internal function?

jakobmoellerdev · 2025-08-20T11:36:21Z

pkg/controllers/externalsecret/externalsecret_controller.go

+		if ctrlerrors.IsNamespaceGone(err) {
+			err = nil
+		} else {
+			r.markAsFailed(msgErrorUpdateSecret, err, externalSecret, syncCallsError.With(resourceLabels))
+		}


nit: cleanup code path as such

if ctrlerrors.IsNamespaceGone(err) { return ctrl.Result{}, nil } r.markAsFailed(msgErrorUpdateSecret, err, externalSecret, syncCallsError.With(resourceLabels)) return ctrl.Result{}, err

jakobmoellerdev · 2025-08-20T11:38:04Z

pkg/controllers/errors/errors.go

+
+	// Ensure that the 404 is for a namespace.
+	status, ok := err.(apierrors.APIStatus)
+	if (ok || errors.As(err, &status)) && status.Status().Details != nil {


this if condition broke my head :D why do you need to call errors.As(err, &status)) here if we already type asserted on it with status, ok := err.(apierrors.APIStatus). Is there ever a case where this is not true if ok is true as well?

gusfcarvalho · 2025-08-20T17:01:41Z

I think this PR can be closed in favor of #5154

jakobmoellerdev · 2025-08-21T16:51:06Z

@agunnerson-elastic are you okay with us closing this in favor of that?

jakobmoellerdev · 2025-08-29T16:49:21Z

closing as discussed. please reopen if needed later

agunnerson-elastic requested a review from a team as a code owner December 4, 2024 20:20

agunnerson-elastic requested a review from moolen December 4, 2024 20:20

gusfcarvalho reviewed Dec 5, 2024

View reviewed changes

pkg/controllers/externalsecret/externalsecret_controller.go Show resolved Hide resolved

github-actions bot added the Stale This issue/Pull Request is stale and will be automatically closed label Mar 10, 2025

github-actions bot removed the Stale This issue/Pull Request is stale and will be automatically closed label Mar 25, 2025

framsouza mentioned this pull request Aug 17, 2025

fix: use finalizers to prevent namespace deletion race conditions #5153

Closed

4 tasks

framsouza mentioned this pull request Aug 17, 2025

fix: handle namespace deletion race conditions with finalizers #5154

Merged

jakobmoellerdev reviewed Aug 20, 2025

View reviewed changes

jakobmoellerdev closed this Aug 29, 2025

Uh oh!

Conversation

agunnerson-elastic commented Dec 4, 2024

Problem Statement

Related Issue

Proposed Changes

Checklist

Uh oh!

sonarqubecloud bot commented Dec 4, 2024

Quality Gate passed

Uh oh!

gusfcarvalho Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

agunnerson-elastic Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

gusfcarvalho Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gusfcarvalho commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gusfcarvalho commented Dec 5, 2024

Uh oh!

agunnerson-elastic commented Dec 5, 2024

Uh oh!

gusfcarvalho commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2025

Uh oh!

gusfcarvalho commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agunnerson-elastic commented Jun 3, 2025

Uh oh!

gusfcarvalho commented Jun 3, 2025

Uh oh!

carlos-ffs commented Jul 22, 2025

Uh oh!

jakobmoellerdev Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

jakobmoellerdev Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

jakobmoellerdev Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

jakobmoellerdev Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

gusfcarvalho commented Aug 20, 2025

Uh oh!

jakobmoellerdev commented Aug 21, 2025

Uh oh!

jakobmoellerdev commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gusfcarvalho Dec 9, 2024 •

edited

Loading

gusfcarvalho commented Dec 5, 2024 •

edited

Loading

gusfcarvalho commented Dec 9, 2024 •

edited

Loading

gusfcarvalho commented Mar 24, 2025 •

edited

Loading