Skip to content

Fix: let operator self-create webhook TLS secret to avoid GitOps overwrites #454

Merged
gflarity merged 2 commits into
ai-dynamo:mainfrom
gflarity:support_helm_template_tooling
Mar 2, 2026
Merged

Fix: let operator self-create webhook TLS secret to avoid GitOps overwrites #454
gflarity merged 2 commits into
ai-dynamo:mainfrom
gflarity:support_helm_template_tooling

Conversation

@gflarity

@gflarity gflarity commented Feb 25, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

The Helm chart template renders an empty grove-webhook-server-cert Secret when certProvisionMode is auto (the default). On subsequent deploys via helm template | kubectl apply or GitOps tools (ArgoCD, FluxCD), the empty values overwrite the populated certificates, breaking webhook TLS until cert-controller repopulates the secret.

This PR fixes the issue with three coordinated changes:

  1. New webhookServerSecret.enabled Helm value (default true for backward compatibility) — allows users deploying via helm template | kubectl apply or GitOps to set it to false, preventing the empty secret from overwriting auto-generated certs on every apply.

  2. Operator self-creates the secret if missing — a new ensureSecretExists() function creates the TLS secret before handing off to cert-controller, which can only Update existing secrets (not Create them). This fills the gap when the Helm-rendered secret is disabled. Handles HA race conditions via AlreadyExists check and preserves existing certificate data.

  3. optional: true on the deployment volume mount — allows the pod to start before the secret exists, since the operator gates webhook serving on certsReadyCh.

Which issue(s) this PR fixes:

Fixes #453

Special notes for your reviewer:

  • The default webhookServerSecret.enabled=true preserves full backward compatibility for helm upgrade users (Helm 3's three-way merge already handles this correctly).
  • The ensureSecretExists function uses a direct (non-cached) client because it runs during manager setup before informer caches start.
  • The operator already has create RBAC permission on secrets via its ClusterRole.
  • All existing e2e tests pass (CM, GS, RU, SO, TAS suites).

Does this PR introduce a API change?

Add `webhookServerSecret.enabled` Helm value to control whether the chart renders the empty webhook TLS Secret. Set to `false` when deploying via `helm template | kubectl apply` or GitOps tools (ArgoCD, FluxCD) to prevent certificate overwrites. The operator will create the secret itself if missing.

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Users deploying via `helm template | kubectl apply` or GitOps tools should set `webhookServerSecret.enabled=false` in their values. The operator will automatically create and manage the webhook TLS secret. Standard `helm upgrade` users do not need to change anything.

…writes

The Helm chart renders an empty webhook TLS secret that overwrites
populated certificates on every `helm template | kubectl apply` or
GitOps sync (ArgoCD, FluxCD). This breaks webhook TLS until
cert-controller repopulates the secret.

Add a `webhookServerSecret.enabled` Helm value (default true for
backward compatibility) so users can disable the Helm-rendered secret.
When disabled, the operator creates the secret itself via a new
`ensureSecretExists()` function, which handles HA race conditions
and preserves existing certificate data. The deployment volume mount
is set to `optional: true` so the pod can start before the secret
exists.

Fixes ai-dynamo#453

Signed-off-by: Geoff Flarity <gflarity@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Feb 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@gflarity gflarity changed the title fix: let operator self-create webhook TLS secret to avoid GitOps overwrites Fixes 453: let operator self-create webhook TLS secret to avoid GitOps overwrites Feb 25, 2026
@gflarity gflarity changed the title Fixes 453: let operator self-create webhook TLS secret to avoid GitOps overwrites Fix: let operator self-create webhook TLS secret to avoid GitOps overwrites Feb 25, 2026
shayasoolin
shayasoolin previously approved these changes Feb 26, 2026

@renormalize renormalize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered helm's lookup function? For users running helm commands this should do the trick in my opinion.

I don't think we should start going about creating resources needed for the operator to run within the operator itself. I don't know if doing so sticks to the conventions set within the community.

If users have other ways of deploying the operator other than the officially supported helm method, then they should adapt their deployment methods.

@gflarity

gflarity commented Feb 26, 2026

Copy link
Copy Markdown
Contributor Author

Have we considered helm's lookup function? For users running helm commands this should do the trick in my opinion.

I don't think we should start going about creating resources needed for the operator to run within the operator itself. I don't know if doing so sticks to the conventions set within the community.

If users have other ways of deploying the operator other than the officially supported helm method, then they should adapt their deployment methods.

tldr; We can't use helm lookup. It doesn't work with helm template, and hence popular CD tools like ArgoCD and FluxCD.


Unfortunately, Helm's lookup function doesn't work with helm template — it returns nil/empty when there's no cluster connection (this is documented behavior). Since both ArgoCD and FluxCD use helm template for manifest generation (ArgoCD #5202 — open since 2021 with 245+ upvotes, still unresolved), lookup can't help for the exact
deployment paths this fix targets.

The lookup approach would only work with direct helm install/upgrade, but that path doesn't have this bug anyway (Helm's 3-way merge preserves live secret data). Also, it's worth noting that despite being a bit of standard, at scale Helm is often used for it's basic template population ability only. So I'd recommend we support more than just Helm installations as first class (Kustomize etc).

Regarding the operator self-creating the secret: the existing OPA cert-controller library already manages (updates) this secret — it just can't create it. This is a gap, but the ensureSecretExists() call is a minimal gap-fill (create-if-missing) to make the upstream library work without depending on Helm to pre-create an empty placeholder. This pattern is common in operators that manage their own TLS.

In my view, the lack of automation in k8s for provisioning webhook certificates itself is a bit of a gap. It's unfortunate we need OPA cert-controller or other tooling like cert-manager. But it's practical and OPA cert-controller is useful (why OPA made it ;).

Comment thread operator/internal/controller/cert/cert.go Outdated
Comment thread operator/internal/controller/cert/cert.go Outdated
Comment thread operator/internal/controller/cert/cert.go Outdated
@renormalize

Copy link
Copy Markdown
Contributor

@gflarity thanks for the explanation!

Yeah, that makes sense. I guess this is the only way we can solve the bug we're seeing. It's a shame the tooling works the way it does, forcing us to include this in the operator.

@gflarity gflarity merged commit 2639186 into ai-dynamo:main Mar 2, 2026
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Helm chart renders empty webhook TLS secret that overwrites populated certs on every apply

4 participants