Describe the bug:
Since upgrading to 1.15.0 we seem to be having problems with CAInjector entering a crashloop state on some of our EKS clusters. The actual error appears to be a timeout listing resources but we cannot see any long running or error'd calls to the control plane that matches this from an EKS perspective. It also appears it uses more memory since the upgrade as we also had this pod OOMing a lot since the upgrade but that is entirely a secondary issue.
cert-manager-cainjector-dfd4bd499-p2x48.log
Expected behaviour:
Pod not be crashing
Steps to reproduce the bug:
- EKS 1.30 running bottlerocket
- Install cert manager 1.15.0
Not sure exactly as it is inconsistent for us. We currently have 2/12 clusters impacted with this and the only notable thing with these two clusters are that they are the biggest 2 we have so best guess it is load related so here is a dump of stats for the smallest of the 2 in case it helps (feel free to request others i'm just dumping info based off of what i have seen cause issues on other issues)
- 70 m5d.8xlarge nodes
- 2558 running pods
- 151 CRDs
- 19952 secrets
Anything else we need to know?:
Environment details::
- Kubernetes version: 1.30
- Cloud-provider/provisioner: AWS EKS 1.30
- cert-manager version: 1.15.0
- Install method: helm
/kind bug
Describe the bug:
Since upgrading to 1.15.0 we seem to be having problems with CAInjector entering a crashloop state on some of our EKS clusters. The actual error appears to be a timeout listing resources but we cannot see any long running or error'd calls to the control plane that matches this from an EKS perspective. It also appears it uses more memory since the upgrade as we also had this pod OOMing a lot since the upgrade but that is entirely a secondary issue.
cert-manager-cainjector-dfd4bd499-p2x48.log
Expected behaviour:
Pod not be crashing
Steps to reproduce the bug:
Not sure exactly as it is inconsistent for us. We currently have 2/12 clusters impacted with this and the only notable thing with these two clusters are that they are the biggest 2 we have so best guess it is load related so here is a dump of stats for the smallest of the 2 in case it helps (feel free to request others i'm just dumping info based off of what i have seen cause issues on other issues)
Anything else we need to know?:
Environment details::
/kind bug