Skip to content

fix(oracle): prevent TCP connections leakage#5491

Merged
moolen merged 8 commits intoexternal-secrets:mainfrom
adutchak:adutchak/oracle-fix-tcp-connections-leakage
Oct 23, 2025
Merged

fix(oracle): prevent TCP connections leakage#5491
moolen merged 8 commits intoexternal-secrets:mainfrom
adutchak:adutchak/oracle-fix-tcp-connections-leakage

Conversation

@adutchak
Copy link
Copy Markdown
Contributor

Problem Statement

Prevents creating new TCP connection for the each secret reconciliation in Oracle OCI\OKE environment when using Workload principal type of authentication.

Related Issue

Fixes #5490

Proposed Changes

How do you like to solve the issue and why?

It turned out that each call to auth.OkeWorkloadIdentityConfigurationProviderWithServiceAccountTokenProvider() (invoked by NewClient()) opens a new TCP connection in the OKE environment and never closes it. The SDK also does not expose any way to close the connection, making it impossible to close it later.

The issue is resolved by saving an auth.ConfigurationProviderWithClaimAccess instance for each ClusterSecretStore in the VaultManagementService structure, so that whenever a new call to the NewClient() function occurs upstream, we either create the configuration provider if it’s missing or return the previously saved one.

Format

Please ensure that your PR follows the following format for the title:

feat(scope): add new feature
fix(scope): fix bug
docs(scope): update documentation
chore(scope): update build tool or dependencies
ref(scope): refactor code
clean(scope): provider cleanup
test(scope): add tests
perf(scope): improve performance
desig(scope): improve design

Where scope is optionally one of:

  • charts
  • release
  • testing
  • security
  • templating

Checklist

  • I have read the contribution guidelines
  • All commits are signed with git commit --signoff
  • [] My changes have reasonable test coverage
  • All tests pass with make test
  • I ensured my PR is ready for review with make reviewable

@github-actions github-actions bot added area/oracle Issues / Pull Requests related to oracle provider kind/bug Categorizes issue or PR as related to a bug. size/s labels Oct 21, 2025
Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>
@adutchak adutchak force-pushed the adutchak/oracle-fix-tcp-connections-leakage branch from 8382ccd to d4a3707 Compare October 21, 2025 17:29
Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>
Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>
Comment on lines +618 to +627
// Caching by store name as a key is sufficient, as each VaultManagementService instance is tied to a single store.
_, ok := vms.authConfigurationsCache[store.GetName()]
if !ok {
vms.authConfigurationsCache[store.GetName()], err = auth.OkeWorkloadIdentityConfigurationProviderWithServiceAccountTokenProvider(tokenProvider)
if err != nil {
return nil, err
}
}

return vms.authConfigurationsCache[store.GetName()], nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong, but if a SecretStore /ClusterSecretStore changes configuration (e.g. due to user typo), this would never be reflected and updated in this cache, right?

Copy link
Copy Markdown
Member

@gusfcarvalho gusfcarvalho Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if SecretStores/ClusterSecretStores have the same name, you'll have even a potentially worse situation 🤔 (cache conflicts)

Copy link
Copy Markdown
Contributor Author

@adutchak adutchak Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the store is esv1.GenericStore. I was focused only on ClusterSecretStore which is cluster-level object. I'll use a different key for that, probably UID.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used UID as caching key instead. Thank you for flagging.

Copy link
Copy Markdown
Contributor Author

@adutchak adutchak Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I double checked and realized that UID's are not changed when users make any changes to the manifests. ResourceVersion - does. So caching by store.GetResourceVersion() key is safer and addresses the first question you raised.

Changed the key to store.GetResourceVersion() still.

Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>
@adutchak adutchak force-pushed the adutchak/oracle-fix-tcp-connections-leakage branch from 97662f7 to 9fdc926 Compare October 21, 2025 19:49
Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>
Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>
@anders-swanson
Copy link
Copy Markdown
Contributor

The way this bug is described, it seems like a bug in the OCI Go SDK where the client is not properly cleaned up by GC. @adutchak have you raised a similar issue on that project?

compartment string
encryptionKey string
workloadIdentityMutex sync.Mutex
authConfigurationsCache map[string]auth.ConfigurationProviderWithClaimAccess
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears this map will grow in size over the lifetime of the operator. Perhaps not enough to make a difference, but "old" store resource versions will accumulate clients over time.

Copy link
Copy Markdown
Contributor Author

@adutchak adutchak Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the map size will be (SecretStores_N * ResourceVersions_N) + (ClusterSecretStores_N * ResourceVersions_N).
I can implement a cache reset as soon as the map’s key-value count reaches, let’s say, 50 items.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9aafe56
50 is still some sort of magic number for me, but it sets at least some boundaries. For more bigger environments, it won't let it over-consuming the memory.

@adutchak
Copy link
Copy Markdown
Contributor Author

The way this bug is described, it seems like a bug in the OCI Go SDK where the client is not properly cleaned up by GC. @adutchak have you raised a similar issue on that project?

No, but this is good point. I will open an issue at the SDK repository too.

@adutchak
Copy link
Copy Markdown
Contributor Author

The way this bug is described, it seems like a bug in the OCI Go SDK where the client is not properly cleaned up by GC. @adutchak have you raised a similar issue on that project?

No, but this is good point. I will open an issue at the SDK repository too.

oracle/oci-go-sdk#606

Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>
Copy link
Copy Markdown
Contributor

@anders-swanson anders-swanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@moolen moolen enabled auto-merge (squash) October 23, 2025 05:59
@moolen moolen merged commit d13866d into external-secrets:main Oct 23, 2025
5 checks passed
@sonarqubecloud
Copy link
Copy Markdown

SamuelMolling pushed a commit to SamuelMolling/external-secrets that referenced this pull request Oct 24, 2025
* fix(oracle): cache ServiceAccountTokenProvider per clustersecretstore

Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>

* rename OKETokenProvider property

Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>

* rename tokenProviders property

Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>

* using UID for caching key

Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>

* Using GetResourceVersion() as caching key

Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>

* Removing unnecessary string conversion

Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>

* Introducing cache reset with authConfigurationsCachePoolSize

Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>

---------

Signed-off-by: Anatoliy Dutchak <anatoliy.dutchak@avalabs.org>
Signed-off-by: Samuel Molling <samuelmolling@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/oracle Issues / Pull Requests related to oracle provider kind/bug Categorizes issue or PR as related to a bug. size/s

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Oracle] clustersecretstores using Workload principals are leaking TCP connections

5 participants