Fix: let operator self-create webhook TLS secret to avoid GitOps overwrites #454
Conversation
…writes The Helm chart renders an empty webhook TLS secret that overwrites populated certificates on every `helm template | kubectl apply` or GitOps sync (ArgoCD, FluxCD). This breaks webhook TLS until cert-controller repopulates the secret. Add a `webhookServerSecret.enabled` Helm value (default true for backward compatibility) so users can disable the Helm-rendered secret. When disabled, the operator creates the secret itself via a new `ensureSecretExists()` function, which handles HA race conditions and preserves existing certificate data. The deployment volume mount is set to `optional: true` so the pod can start before the secret exists. Fixes ai-dynamo#453 Signed-off-by: Geoff Flarity <gflarity@nvidia.com>
renormalize
left a comment
There was a problem hiding this comment.
Have we considered helm's lookup function? For users running helm commands this should do the trick in my opinion.
I don't think we should start going about creating resources needed for the operator to run within the operator itself. I don't know if doing so sticks to the conventions set within the community.
If users have other ways of deploying the operator other than the officially supported helm method, then they should adapt their deployment methods.
tldr; We can't use helm lookup. It doesn't work with helm template, and hence popular CD tools like ArgoCD and FluxCD. Unfortunately, Helm's lookup function doesn't work with helm template — it returns nil/empty when there's no cluster connection (this is documented behavior). Since both ArgoCD and FluxCD use helm template for manifest generation (ArgoCD #5202 — open since 2021 with 245+ upvotes, still unresolved), lookup can't help for the exact The lookup approach would only work with direct helm install/upgrade, but that path doesn't have this bug anyway (Helm's 3-way merge preserves live secret data). Also, it's worth noting that despite being a bit of standard, at scale Helm is often used for it's basic template population ability only. So I'd recommend we support more than just Helm installations as first class (Kustomize etc). Regarding the operator self-creating the secret: the existing OPA cert-controller library already manages (updates) this secret — it just can't create it. This is a gap, but the ensureSecretExists() call is a minimal gap-fill (create-if-missing) to make the upstream library work without depending on Helm to pre-create an empty placeholder. This pattern is common in operators that manage their own TLS. In my view, the lack of automation in k8s for provisioning webhook certificates itself is a bit of a gap. It's unfortunate we need OPA cert-controller or other tooling like cert-manager. But it's practical and OPA cert-controller is useful (why OPA made it ;). |
|
@gflarity thanks for the explanation! Yeah, that makes sense. I guess this is the only way we can solve the bug we're seeing. It's a shame the tooling works the way it does, forcing us to include this in the operator. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
The Helm chart template renders an empty
grove-webhook-server-certSecret whencertProvisionModeisauto(the default). On subsequent deploys viahelm template | kubectl applyor GitOps tools (ArgoCD, FluxCD), the empty values overwrite the populated certificates, breaking webhook TLS until cert-controller repopulates the secret.This PR fixes the issue with three coordinated changes:
New
webhookServerSecret.enabledHelm value (defaulttruefor backward compatibility) — allows users deploying viahelm template | kubectl applyor GitOps to set it tofalse, preventing the empty secret from overwriting auto-generated certs on every apply.Operator self-creates the secret if missing — a new
ensureSecretExists()function creates the TLS secret before handing off to cert-controller, which can only Update existing secrets (not Create them). This fills the gap when the Helm-rendered secret is disabled. Handles HA race conditions viaAlreadyExistscheck and preserves existing certificate data.optional: trueon the deployment volume mount — allows the pod to start before the secret exists, since the operator gates webhook serving oncertsReadyCh.Which issue(s) this PR fixes:
Fixes #453
Special notes for your reviewer:
webhookServerSecret.enabled=truepreserves full backward compatibility forhelm upgradeusers (Helm 3's three-way merge already handles this correctly).ensureSecretExistsfunction uses a direct (non-cached) client because it runs during manager setup before informer caches start.createRBAC permission on secrets via its ClusterRole.Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: