Skip to content

Ignore reconciliation errors caused by namespace termination#4176

Closed
agunnerson-elastic wants to merge 1 commit intoexternal-secrets:mainfrom
agunnerson-elastic:namespace-deletion
Closed

Ignore reconciliation errors caused by namespace termination#4176
agunnerson-elastic wants to merge 1 commit intoexternal-secrets:mainfrom
agunnerson-elastic:namespace-deletion

Conversation

@agunnerson-elastic
Copy link
Copy Markdown
Contributor

Problem Statement

When namespace finalizers are used, it's pretty easy to hit a race condition where a namespace is deleted while the ClusterExternalSecret and ExternalSecret reconciliation loops are running. When namespaces are regularly being created and deleted, this causes ESO to report a large number of errors, both in the logs and in the prometheus metrics, making it more difficult to identify and monitor actual issues.

Related Issue

(None reported)

Proposed Changes

This commit prevents errors caused by namespace deletion from being logged or counting towards the error metrics. These errors are "expected" to happen during namespace deletion and don't represent an issue with the service.

This is the same approach used in Kubernetes itself, for example in the Deployment, Job, and EndpointSlice controllers.

Checklist

  • I have read the contribution guidelines
  • All commits are signed with git commit --signoff
  • My changes have reasonable test coverage
    • Happy to get some suggestions for this (if tests are needed for this change). Not sure if there's a good way to test for the presence or absence of a log message.
  • All tests pass with make test
  • I ensured my PR is ready for review with make reviewable

When namespace finalizers are used, it's pretty easy to hit a race
condition where a namespace is deleted while the ClusterExternalSecrets
and ExternalSecrets reconciliation loops are running. This commit
excludes errors caused by namespace deletion from being logged or
counting towards the error metrics since they are not actionable and are
"expected".

Signed-off-by: Andrew Gunnerson <andrew.gunnerson@elastic.co>
@agunnerson-elastic agunnerson-elastic requested a review from a team as a code owner December 4, 2024 20:20
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Dec 4, 2024

Comment on lines +314 to +316
if apierrors.IsNotFound(updateErr) {
return
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we sure this not found relates to a deleted namespace?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Maybe this should just log an informational message instead. (Eg. like what Kubernetes does for builtin resources like replicasets: https://github.com/kubernetes/kubernetes/blob/v1.31.3/pkg/controller/replicaset/replica_set.go#L687-L691)

Copy link
Copy Markdown
Member

@gusfcarvalho gusfcarvalho Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is different than the replicaset code you've set - it is for an updateError - it means we were able to get an ExternalSecret, and while performing an update, the ExternalSecret was deleted (possibly due to namespace failure).

This is the equivalent RS implementation https://github.com/kubernetes/kubernetes/blob/v1.31.3/pkg/controller/replicaset/replica_set.go#L728-L733

@gusfcarvalho
Copy link
Copy Markdown
Member

gusfcarvalho commented Dec 5, 2024

Hi @agunnerson-elastic !

Even though I added some comments on the review - I am not sure this is really something we should ignore.

From the flow perspective:

  1. CES identifies X namespaces that are rightfully labeled so to get an ExternalSecret;
  2. A namespace gets deleted meanwhile CES is still creating the ExternalSecret;

For me ☝️ this is an error condition that we should monitor & Requeue ASAP to get a new set of namespaces with the right labels (now with one namespace less).

I get the whole ephemerality of the env you're running, but this is going to go to everywhere even when namespaces are not ephemeral 😄 wouldn't you want to monitor this failure condition if that was the case?

edit: can't we instead perform this filter for ns not marked as deletion on getTargetNamespaces? This way the reconciler definition would always be known and these errors would not occur due to race conditions at all.

@gusfcarvalho
Copy link
Copy Markdown
Member

Also, what's exactly the race condition here?

From my reading, it is just another reconciliation loop for both CES and ES until things are cleared out, which is generating the errors. Is my understanding correct?

Or is this a race condition between CES and ES?

@agunnerson-elastic
Copy link
Copy Markdown
Contributor Author

Thanks for reviewing this, @gusfcarvalho!

For me ☝️ this is an error condition that we should monitor & Requeue ASAP to get a new set of namespaces with the right labels (now with one namespace less).

That's fair. I suppose from my point of view, ideally, the behavior of ESO would be the same if a namespace is deleted when the reconciliation loop is not running vs. when it is running.

This is where I took inspiration from for this approach:

A deployment will silently avoid creating a replicaset and a replicaset will silently avoid creating a pod if the namespace is terminating/terminated. It doesn't seem like clusterexternalsecret creating externalsecret is much different from these scenarios.

edit: can't we instead perform this filter for ns not marked as deletion on getTargetNamespaces? This way the reconciler definition would always be known and these errors would not occur due to race conditions at all.

Unfortunately not. That's how I originally tried implementing it. We ran into issues where the namespace was still "alive" during getTargetNamespaces, but was already gone before createOrUpdateExternalSecret, which would fail. I think the only way to avoid TOCTOU issues is to check the errors after attempting to perform the operation.

From my reading, it is just another reconciliation loop for both CES and ES until things are cleared out, which is generating the errors. Is my understanding correct?

Sorry I was unclear. Yep, this is all I meant.

@gusfcarvalho
Copy link
Copy Markdown
Member

gusfcarvalho commented Dec 9, 2024

I still wish for something a bit more native and easier to understand. Perhaps we could just add finalizers to each namespace bound to ClusterExternalSecret or finalizers to the ExternalSecret objects themselves.

That ☝️ would prevent TOCTOU for sure.

The only case that wouldn't be covered is the namespace that still didnt have an ExternalSecret object upon creation - which again I think we should error 😄. IMO there is a difference between erroring on a replicaset (which is bound to one namespace) and a CES (which target several namespaces via a selector). The better analogy IMO is if pods with that specific label were removed and then the RS controller needed to figure out what to do 😄.

@github-actions
Copy link
Copy Markdown

This pr is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the Stale This issue/Pull Request is stale and will be automatically closed label Mar 10, 2025
@gusfcarvalho
Copy link
Copy Markdown
Member

gusfcarvalho commented Mar 24, 2025

hi @agunnerson-elastic ! were you able to check my point? Is this something you're still working on?

@github-actions github-actions bot removed the Stale This issue/Pull Request is stale and will be automatically closed label Mar 25, 2025
@agunnerson-elastic
Copy link
Copy Markdown
Contributor Author

hi @agunnerson-elastic ! were you able to check my point? Is this something you're still working on?

Hey @gusfcarvalho! Sorry for disappearing and not replying sooner. I'll look into trying your approach of using finalizers to prevent TOCTOU issues in the coming months.

@gusfcarvalho
Copy link
Copy Markdown
Member

No worries! it's the nature of Open source! LMK how it goes 😄

@carlos-ffs
Copy link
Copy Markdown

I believe this PR will solve my issue. I’m defining a ClusterExternalSecret like this:

apiVersion: external-secrets.io/v1
kind: ClusterExternalSecret
metadata:
  name: ces-name
  namespace: {{ .Release.Namespace }}
spec:
  namespaceSelectors:
    - matchLabels: {}
...

My intention is to deploy the same secret to every namespace. However, whenever a namespace is deleted, CES keeps trying to recreate the ExternalSecret in that terminating namespace, which prevents it from ever finishing termination. CES reports this error:

status:
  ...
  failedNamespaces:
    - namespace: acme-ns
      reason: >-
        could not create or update ExternalSecret:
        externalsecrets.external-secrets.io "acme-secret" is
        forbidden: unable to create new content in namespace acme-ns because
        it is being terminated

framsouza pushed a commit to framsouza/external-secrets that referenced this pull request Aug 17, 2025
Replace error suppression approach with proper finalizer-based solution
to handle namespace deletion race conditions in ClusterExternalSecret
and ExternalSecret controllers.

Changes:
- Add finalizers to ClusterExternalSecret for proper cleanup ordering
- Add finalizers to ExternalSecret objects for managed secret cleanup
- Add namespace finalizers to prevent deletion while CES is bound
- Check namespace existence and termination status before processing
- Remove error suppression logic (IsNamespaceGone function)

This approach prevents race conditions by ensuring resources are cleaned
up in the correct order, rather than just hiding error symptoms.

Fixes the issues raised in external-secrets#4176
framsouza pushed a commit to framsouza/external-secrets that referenced this pull request Aug 17, 2025
Replace error suppression approach with proper finalizer-based solution
to handle namespace deletion race conditions in ClusterExternalSecret
and ExternalSecret controllers.

Changes:
- Add finalizers to ClusterExternalSecret for proper cleanup ordering
- Add finalizers to ExternalSecret objects for managed secret cleanup
- Add namespace finalizers to prevent deletion while CES is bound
- Check namespace existence and termination status before processing
- Remove error suppression logic (IsNamespaceGone function)

This approach prevents race conditions by ensuring resources are cleaned
up in the correct order, rather than just hiding error symptoms.

Fixes the issues raised in external-secrets#4176

Signed-off-by: framsouza <fram.souza14@gmail.com>
Comment on lines +157 to +160
if !ctrlerrors.IsNamespaceGone(err) {
log.Error(err, "failed to create or update external secret")
failedNamespaces[namespace.Name] = err
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe it makes sense that if we do have gone namespace, we log a a debug message

apierrors "k8s.io/apimachinery/pkg/api/errors"
)

func IsNamespaceGone(err error) bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this really be part of pkg or is this a purely internal function?

Comment on lines +517 to +521
if ctrlerrors.IsNamespaceGone(err) {
err = nil
} else {
r.markAsFailed(msgErrorUpdateSecret, err, externalSecret, syncCallsError.With(resourceLabels))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cleanup code path as such

if ctrlerrors.IsNamespaceGone(err) {
	return ctrl.Result{}, nil
}

r.markAsFailed(msgErrorUpdateSecret, err, externalSecret, syncCallsError.With(resourceLabels))
return ctrl.Result{}, err


// Ensure that the 404 is for a namespace.
status, ok := err.(apierrors.APIStatus)
if (ok || errors.As(err, &status)) && status.Status().Details != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if condition broke my head :D why do you need to call errors.As(err, &status)) here if we already type asserted on it with status, ok := err.(apierrors.APIStatus). Is there ever a case where this is not true if ok is true as well?

@gusfcarvalho
Copy link
Copy Markdown
Member

I think this PR can be closed in favor of #5154

@jakobmoellerdev
Copy link
Copy Markdown
Contributor

@agunnerson-elastic are you okay with us closing this in favor of that?

@jakobmoellerdev
Copy link
Copy Markdown
Contributor

closing as discussed. please reopen if needed later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants