Skip to content

feat: significantly reduce api calls and introduce partial secret cache#4086

Merged
Skarlso merged 9 commits intoexternal-secrets:mainfrom
thesuperzapper:improve-external-secrets-controller
Nov 24, 2024
Merged

feat: significantly reduce api calls and introduce partial secret cache#4086
Skarlso merged 9 commits intoexternal-secrets:mainfrom
thesuperzapper:improve-external-secrets-controller

Conversation

@thesuperzapper
Copy link
Copy Markdown
Contributor

@thesuperzapper thesuperzapper commented Nov 9, 2024

Problem Statement

This PR significantly improves the performance and stability of the ExternalSecrets controller.

While there are many changes (see below), I want to summarize the goals of this PR:

  1. Reduce the number of Kubernetes API calls made by the ExternalSecrets controller.
  2. Only make UPDATE calls to resources when it is necessary.

Related Issue

Proposed Changes

I have tested the PR for most use-cases I can think of, and have ensured that the make test pass, but I can't run make test.e2e locally.

However, there may be subtle behavior changes that I have not considered, so I would appreciate a thorough review.

The following sections cover the specific changes made in this PR.

Changes to caching:

We have a new controller flag --enable-managed-secrets-caching which defaults to true.
Also, we have a new reconcile.external-secrets.io/managed=true label which is added to all target secrets managed by ExternalSecrets.

Previously, we were not caching any Secrets (when --enable-secrets-caching=false, which is the default).
This was causing a lot of unnecessary API calls during every reconcile of each ExternalSecret.
To reduce the number of API calls and still avoid high memory usage on clusters with many secrets, we now have a special cache that only caches secrets with the reconcile.external-secrets.io/managed=true label.

Users can disable this cache by setting --enable-managed-secrets-caching=false.
However, this is not recommended, as it's better to increase the memory of the controller to handle all the secrets you want to manage with ExternalSecrets.

Changes when updating target secrets:

We now only update the target secret if the data has changed.

We use the equality.Semantic.DeepEqual() to ensure we would be making a change before we do it.

This change is critical for systems like Kyverno, as they rely on admission events to trigger restarts of deployments when secrets are updated. For example, previously, if an ExternalSecret did not set spect.target.template.type, it would do an update call every reconcile because it would try and unset the Opaque default value.

Changes when spec.creationPolicy=Owner:

Previously, some keys might be retained, if they existed previously on the secret before the ExternalSecret took ownership of it.

Now, the ExternalSecret defines the full "desired state" of the target Secret, and any extraneous data keys will be removed. (WARNING: this might mean your target secret keys are "cleaned up" when updating to this PR).

Changes when data key is removed from target template:

Previously, there were strange situations where a data key being removed from an ExternalSecret target template would not result in the key being removed from the target Secret.

Now, the behavior is that any key which was previously managed by the ExternalSecret will be removed if it no longer in the template or data/dataFrom definitions. (WARNING: this might mean your target secret keys are "cleaned up" when updating to this PR)

Changes for immutable secrets:

There are two ways a secret could become immutable:

  1. The user tells ExternalSecrets to make the target immutable by setting spec.target.template.immutable=true.
  2. The target secret already existed and had the immutable field set to true.

We now correctly handle both cases, and we will not attempt to update the target secret's data, if it's immutable.

If the controller updated the data of an immutable secret, the ExternalSecret will put into an error status, with a message explaining why. However, we will still update the metadata of the target secret if it changes (for our tracking purposes).

Changes when two ExternalSecrets try to claim creationPolicy=Owner of the same target

Previously, we would loop forever in the case that another ExternalSecret (or some other object) was already the owner of the target Secret (because we returned an error on the reconcile loop).

Now we have a nice error condition in the status of the ExternalSecret, and just wait for the target secret to be updated again to see if it's now not owned.

Changes when spec.refreshInterval=0:

We now correctly handle the case where spec.refreshInterval=0 so it matches our documentation:

  • We will only create the secret once, and we will not update it, even if the spec.target.template changes in the ExternalSecret.
  • If the target secret is deleted, we will recreate it with the latest data/template.
  • If the target secret data is manually updated, we will revert it back to the latest data/template from the ExternalSecret.

Changes when spec.creationPolicy=Merge and missing target secret:

We have introduced a new SecretMissing reason which we use to indicate that the target secret does not exist.
When an ExternalSecret is in this state, we will check again every refreshInterval to see if the target secret has been created.

Changes when spec.creationPolicy is changed away from Owner:

Previously, we would not clean up the metadata.ownerReferences on the target secret if the ExternalSecret had its creation policy changed away from Owner, this led to the possibility of the target secret being deleted when the ExternalSecret was deleted. We also did not clean up the reconcile.external-secrets.io/created-by label.

We now remove both of these things when the creation policy is changed away from Owner.

Changes to requeue logic:

We now correctly calculate how long we should wait before the next reconcile.

This was inconsistent previously, and sometimes we would wait way longer than refreshInterval between reconciles.
Now we always use our newly added getRequeueResult() to calculate the next reconcile time.

No longer use PATCH for updates:

We now always use Update() calls rather than Patch() to update target secrets.

The Patch function does not check if the resource has changed before updating it so can be problematic if multiple controllers are updating the same resource. If we get an Update conflict, we requeue immediately.

Removals

Future work

  • We should also cache secrets that are used in SecretStore provider credentials.
    Currently, these are still retrieved every loop, and are not currently cached when --enable-secrets-caching=false.

  • We should apply the changes to how we patch status (using the deferred function) in the other controllers
    As they also do unnecessary updates and use Patch() calls rather than Update() calls.

Checklist

  • I have read the contribution guidelines
  • All commits are signed with git commit --signoff
  • My changes have reasonable test coverage
  • All tests pass with make test
  • I ensured my PR is ready for review with make reviewable

Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>
@thesuperzapper thesuperzapper requested a review from a team as a code owner November 9, 2024 23:27
@thesuperzapper
Copy link
Copy Markdown
Contributor Author

@Skarlso or @moolen can you please give this an /ok-to-test, so we can run the e2e tests on it?

This is a very important improvement to the way reconciliation works but I was not able to run the e2e tests locally.

Happy to answer any questions as I know this is a big PR.

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 10, 2024

/ok-to-test sha=facdfc036c93dfa5bc7fd3041d0353dbad385bee

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 10, 2024

I'm interested in reviewing this, however, I'll be out next week on Kubecon so I can't promise anything.

@eso-service-account-app
Copy link
Copy Markdown
Contributor

@thesuperzapper
Copy link
Copy Markdown
Contributor Author

thesuperzapper commented Nov 12, 2024

For anyone reviewing, there is a discussion here #4099 (comment) about if we should change the behavior of ExternalSecrets with refreshInterval=0, so that we don't revert re-sync if the data in the target secret is modified.

Currently, when refreshInterval=0, a sync only happens under the following conditions:

  1. The ExternalSecret has never been synced before
  2. The target Secret does not exist (perhaps because it was deleted)
  3. The target secret's data was changed outside the ExternalSecret's control:
    • Detected by the reconcile.external-secrets.io/data-hash annotation not being correct.

If we want to change this, we can add another check in isSecretValid() so the data hash only applies if refreshInterva>0 (but I am still on the fence, see the discussion in the other issue).

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 16, 2024

Search for if errors.Is(err, esv1beta1.NotModifiedErr) { in the code to find what I removed.

Oh f*ck that should have been inside the above error. How did I miss that? :D :D Thanks.

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 16, 2024

Thanks for the awesome PR description. That helps a lot to have the right context.

Copy link
Copy Markdown
Contributor

@Skarlso Skarlso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did my first review round with questions and some code improvement requests.

Well done, nice work so far!

Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>
@Skarlso Skarlso added this to the 0.11.0 milestone Nov 20, 2024
Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>
Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>
@thesuperzapper
Copy link
Copy Markdown
Contributor Author

@Skarlso I was chatting with @gusfcarvalho and I realized there was a longstanding bug I could easily fix.

The bug was when the creationPolicy was changed away from Owner, we would not remove the metdata.ownerReferences or reconcile.external-secrets.io/created-by label, this meant the target secret might be deleted despite having a creationPolicy of None, Orphan, or Owner.

See 177b0c6 for the change.

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 21, 2024

Nice! Thanks for the improvements.

Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>
Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>
@thesuperzapper
Copy link
Copy Markdown
Contributor Author

@Skarlso I have made another important fix in c3cd973 to prevent endless loops that consume lots of API calls needlessly for "permeant errors" which retrying won't fix:

  • Now, when two ExternalSecrets try and claim ownership of the same target secret, we dont bother retrying (which we did forever previously because we returned an error on the reconcile).
  • Now, we explicitly detect the sate where one ExternalSecret is creationPolicy Owner and the other is Merge, and put the merge one into an error state explaining this.

Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>
Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>
@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 24, 2024

@thesuperzapper Where are you now with it? I would say, don't do any more fixes. :D Because this is awesome as is and we really would like to get this merged. :)

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 24, 2024

How the heck did it get worse? :D

Refactor this method to reduce its Cognitive Complexity from 105 to the 15 allowed.

We did so many refactors. :D

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 24, 2024

/ok-to-test sha=4c387ccd01146cfc9fa4e623bec70d4136b5e961

@eso-service-account-app
Copy link
Copy Markdown
Contributor

@sonarqubecloud
Copy link
Copy Markdown

@Skarlso
Copy link
Copy Markdown
Contributor

Skarlso commented Nov 24, 2024

Thank you very much for your contribution! This is a significant increase in performance of ESO and a much needed one. :)

Also, reviewing this has been my pleasure that I could learn some more interesting things about the various caching mechanisms of the controller runtime.

@Skarlso Skarlso merged commit ac26166 into external-secrets:main Nov 24, 2024
@thesuperzapper thesuperzapper deleted the improve-external-secrets-controller branch November 24, 2024 22:24
dtejadav pushed a commit to dtejadav/external-secrets that referenced this pull request Nov 25, 2024
…he (external-secrets#4086)

* feat: reduce api calls and introduce partial secret cache

Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>

* updates from review 1

Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>

* updates from review 2

Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>

* fix updating CreationPolicy after secret creation

Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>

* updates from review 3

Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>

* prevent loop when two ES claim Owner on the same target secret

Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>

* updates from review 4

Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>

* fix ClusterSecretStore not ready message

Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>

---------

Signed-off-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>
Co-authored-by: Gergely Brautigam <182850+Skarlso@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

3 participants