feat(appliance): roll frontend deployment when pgsql secret changes#63845
Conversation
| ifChanged := struct { | ||
| config.FrontendSpec | ||
| PG *config.DatabaseConnectionSpec `json:"pg,omitempty"` | ||
| }{ | ||
| FrontendSpec: cfg, | ||
| PG: dbConnSpec, | ||
| } | ||
|
|
||
| return reconcileObject(ctx, r, ifChanged, &dep, &appsv1.Deployment{}, sg, owner) |
There was a problem hiding this comment.
Example of how individual resources in individual services can add data to the "ifChanged" parameter of reconcileObject. Usually, just the config element itself is passed. Please follow the calls in https://github.com/sourcegraph/sourcegraph/blob/main/internal/appliance/reconciler/kubernetes.go through for more context - reconcileObject() throws the version into the mix as well. So in general we update kubernetes resources when the service config or the version has changed, and now we see a facility to throw more arbitrary data into there too.
| @@ -0,0 +1,561 @@ | |||
| resources: | |||
There was a problem hiding this comment.
We can see the config hash annotation changing compared to default.yaml, which would cause a rolling deploy.
❯ diff internal/appliance/reconciler/testdata/golden-fixtures/frontend/default.yaml internal/appliance/reconciler/testdata/golden-fixtures/frontend/after-create-pg-secret.yaml
6c6
< appliance.sourcegraph.com/configHash: b5dce290e22d1afb4c9102ac4c245490ab01dd3be13653de391536cfe0e323b0
---
> appliance.sourcegraph.com/configHash: e32d95c60ed3e814ce53f11507d1c894ae3b417b653a9f307379b9252d6d5785
8c8
< generation: 1
---
> generation: 2
There was a problem hiding this comment.
Dang, I just realised, this won't cause a rolling deploy! We change the configHash annotation on the deployment object, which causes our reconciler's createOrUpdateObject() to update the object, but without updating Deployment.spec.template no rolling deploy will actually occur! That was really dumb of me. Will fix.
| // We have to make a config change to trigger the reconcile loop | ||
| suite.awaitReconciliation(namespace, func() { | ||
| cfgMap := suite.newConfigMap(namespace, "frontend/default") | ||
| cfgMap.GetAnnotations()["force-reconcile"] = "1" |
There was a problem hiding this comment.
This might make a good appliance web app feature - a "force re-reconcile" button. It could be implemented the same way this test does, incrementing a number on a configmap annotation, or just putting something random in there.
There was a problem hiding this comment.
I ended up having to do something similar (but not quite as elegant) in the POC with just forcing a reconcile every 5 seconds or so no matter what. Not ideal though, and a button to trigger it is likely a far better idea 😄 .
There was a problem hiding this comment.
I haven't checked what the PoC does, but controller-runtime actually lets you do something like that. Maybe we can make the duration configurable. e.g. at the end of Reconcile() in reconcile.go:
return ctrl.Result{
RequeueAfter: time.Hour,
}, nilSeparate issue I think, if we want to do this
e8ef1bb to
75897da
Compare
Ensure configHash changes by adding connection secret to the ifChanged object, which causes createOrUpdateObject() in kubernetes.go to update the object. Set the "checksum/auth" annotation on the pod template in frontend's deployment to a hash of this connection object, to cause a rolling deployment. This annotation exists in helm, but this mechanism is slightly more general as it always hashes secret values, whereas helm will sometimes be solely hashing the name of a secret whose values may have changed.
75897da to
4912b99
Compare
| if err != nil { | ||
| return err | ||
| } | ||
| template.Template.ObjectMeta.Annotations["checksum/auth"] = dbConnHash |
There was a problem hiding this comment.
this is what actually causes the rolling deploy, after the appliance updates the deployment
|
@jdpleiness this is ready for review again now, sorry for the churn |
…es (#63954) Builds on https://github.com/sourcegraph/sourcegraph/pull/63845, and closes https://linear.app/sourcegraph/issue/REL-14/ensure-pods-roll-when-referenced-secret-changes. When one of the 3 DB configs changes, an annotation "checksum/auth" on frontend's deployment's spec.template.metadata should change on next reconcile. This will cause pods to roll, picking up the new secret values. It should also cause the top-level annotation configHash to change, which indicates to the appliance that the kubernetes resource should be updated. A similar mechanism is implemented for "checksum/redis", on every service that uses redis (obtained by grepping the helm chart for the same annotation).
…63845) This approach is a bit of a departure from how our helm chart works. That hashes hashes values blocks to determine these annotations. The issue is, that these values blocks often contain references to secrets, and therefore don't change when the _content_ of the secret changes. This PR takes a more general approach of looking for that secret, which must always eventually exist - whether admin-provisioned or provisioned by the pgsql service definition - and hashes that.
…es (#63954) Builds on https://github.com/sourcegraph/sourcegraph/pull/63845, and closes https://linear.app/sourcegraph/issue/REL-14/ensure-pods-roll-when-referenced-secret-changes. When one of the 3 DB configs changes, an annotation "checksum/auth" on frontend's deployment's spec.template.metadata should change on next reconcile. This will cause pods to roll, picking up the new secret values. It should also cause the top-level annotation configHash to change, which indicates to the appliance that the kubernetes resource should be updated. A similar mechanism is implemented for "checksum/redis", on every service that uses redis (obtained by grepping the helm chart for the same annotation).
Idea for https://linear.app/sourcegraph/issue/REL-14/ensure-pods-roll-when-referenced-secret-changes, does not close it though. Please see that issue for more context and alternative approaches considered.
This PR is designed to illustrate the idea more clearly than typing a description of it - do feel free to push back, just because it's already PR'ed doesn't mean it's "too late!". If the general approach is 👍, I can make similar PRs for the other services and their dependencies (basically everything that depends on codeintel/codeinsights DBs and Redis) to close this.
This approach is a bit of a departure from how our helm chart works. That hashes hashes values blocks to determine these annotations. The issue is, that these values blocks often contain references to secrets, and therefore don't change when the content of the secret changes. Example:
https://github.com/sourcegraph/deploy-sourcegraph-helm/blob/84ea4b213d8c08390f3b3335c3f226016a0408b5/charts/sourcegraph/templates/frontend/sourcegraph-frontend.Deployment.yaml#L32 - this annotation, when changed, causes pods to roll on deployment.
https://github.com/sourcegraph/deploy-sourcegraph-helm/blob/84ea4b213d8c08390f3b3335c3f226016a0408b5/charts/sourcegraph/templates/_helpers.tpl#L259-L264 - the template helper hashes some values blocks
https://github.com/sourcegraph/deploy-sourcegraph-helm/blob/84ea4b213d8c08390f3b3335c3f226016a0408b5/charts/sourcegraph/values.yaml#L722-L736 - if the auth block references an existing secret instead of being set directly, this won't change when the secret content changes.
This PR takes a more general approach of looking for that secret, which must always eventually exist - whether admin-provisioned or provisioned by the pgsql service definition - and hashes that.
Draft because I'm looking for approach review,
also because this depends on an unmerged branch. This can be reviewed independently, but that bug fix (https://github.com/sourcegraph/sourcegraph/pull/63826) is also necessary for this to actually work, otherwise the admin-provisioned secret is blown away by the reconcile loop.Test plan
envtest tests included.
Changelog