Skip to content

ExternalSecret reconciliation silently stops with 2.0.1 #6053

@dnwlf

Description

@dnwlf

Describe the bug
After upgrading to the External Secrets Operator 2.0.1 helm chart, clusters with a large number of ESO resources silently stop reconciling ESO resources.

One cluster where we observed this issue has the following ESO resource counts:

  • ExternalSecret: ~3000
  • SecretStore: ~140
  • ClusterExternalSecrets: 2
  • ClusterSecretStores: 2

We are unable to find any errors or warnings in the ESO controller logs that would indicate issues, the .status.refreshTime for ESO resources just stops updating.

We can tell that reconciliation is not completely broken. Restarting the ESO controller Deployment in a cluster does result in at least a single reconciliation of the cluster's ESO resources (based on .status.refreshTime), but reconciliation silently stops shortly after Deployment restart.

The CPU/Memory resource utilization of the ESO controller and webhook pods was well-under the configured requests/limits in clusters where we observed this issue.

We do not see this issue in clusters that have a smaller number of ESO resources deployed.

After reverting to the ESO 2.0.0 helm charts with no other cluster changes, reconciliation is once again working as-expected in all clusters.

To Reproduce
Steps to reproduce the behavior:

  1. provide all relevant manifests
    # example values.yaml with the relevant ESO configuration for a cluster where we observed this issue
    external-secrets:
      replicaCount: 3
      concurrent: 50
      extraArgs:
        experimental-enable-vault-token-cache: true
        client-qps: 100
        client-burst: 200
  2. provide the Kubernetes and ESO version
    1. GKE Version: 1.33.5
    2. ESO Version: 2.0.1 (via external-secrets helm chart)

Expected behavior
Reconciliation should not silently stop.

Screenshots
N/A

Additional context
If there is additional information that would be helpful to provide, please let me know and I can gather it. Unfortunately, at the time this issue occurred we weren't collecting all of the ESO metrics which would make this easier to troubleshoot/diagnose.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.triage/pending-triageThis issue was not triaged.

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions