feat(appliance): healthchecker manages ingress-facing service#64043
Conversation
There was a problem hiding this comment.
Note: cc @jdpleiness. Please see commit message for more context. WDYT of this co-ordination mechanism?
This also relates to the discussion in https://github.com/sourcegraph/sourcegraph/pull/64021#discussion_r1689609233 (cc @Chickensoupwithrice): if we do end up changing reconcileFrontendService() so that it only selectively patches certain object fields, and therefore never clobber any field not in the json patch, this mechanism alone might be enough for the routing fallback to just work! 🪄
WDYT?
There was a problem hiding this comment.
Very nice! I think we could close this after we transition to the "refresh" state after the "wait-for-admin" state where the admin switches over to the frontend instance.
Something like: "wait-for-admin" -> (admin clicks launch UI button) -> "refresh state" -> close channel and start monitoring frontend and transition to "maintenance state" in the background
There was a problem hiding this comment.
SGTM! Just had another idea: in order to avoid having to handle 2 cases, 1 for transitioning past the admin button and the other for the appliance booting when the configmap is already in a post-install state, we could close this channel in the reconcile loop if the status field demands it.
IIRC booting a controller does fire off reconcile loops for watched resources that already exist, since new watchers are started.
There was a problem hiding this comment.
worth reading this IMO
There was a problem hiding this comment.
Yea cool illustration of using the k8sClient
DaedalusG
left a comment
There was a problem hiding this comment.
This looks good to me, I'll likely reuse bits of the logic for polling the states of pods for the maintenance splash page
There was a problem hiding this comment.
I think this might be a good place for the splash page Im working on:
https://github.com/sourcegraph/sourcegraph/pull/64019

There was a problem hiding this comment.
Will likely use something like this in the maintenance splash page
There was a problem hiding this comment.
I feel like we might want to get this as an interface healthChecker implements? Just because later we'll want to be able to allow the user to select Sourcegraph or Maintenance UI. Not for regular users but for admins. Thats for down the line though.
There was a problem hiding this comment.
We can always expose it as public later 👍
There was a problem hiding this comment.
Yea cool illustration of using the k8sClient
Add a new background goroutine to the appliance. It does nothing until a "begin" channel closes. The idea is that another part of the appliance will close this channel if the configmap state is set to a post-install value (or on startup if this is already the case when an appliance boots). After this barrier is lifted, the healtchecker periodically checks the readiness (using k8s conditions) of each pod returned by the frontend deployment's label selector. If even a single pod is ready, it ensures that the service points to frontend. Otherwise, it waits for a grace period, checks again, and if downtime persists, it points the service to the appliance. This should cover the following cases: - The service is pointed to frontend after the admin clicks "go" after an intial successful install. - The service is pointed to appliance after frontend downtime that exceeds the grace period. - The servie is promptly pointed to frontend after downtime ends.
But we need the appliance to set the status in order to trigger _this_.
4435e91 to
0d772c3
Compare
**Appliance points ingress-facing service to itself by default** Not frontend. **feat(appliance): healthchecker manages ingress-facing service** Add a new background goroutine to the appliance. It does nothing until a "begin" channel closes. The idea is that another part of the appliance will close this channel if the configmap state is set to a post-install value (or on startup if this is already the case when an appliance boots). After this barrier is lifted, the healtchecker periodically checks the readiness (using k8s conditions) of each pod returned by the frontend deployment's label selector. If even a single pod is ready, it ensures that the service points to frontend. Otherwise, it waits for a grace period, checks again, and if downtime persists, it points the service to the appliance. This should cover the following cases: - The service is pointed to frontend after the admin clicks "go" after an initial successful install. - The service is pointed to appliance after frontend downtime that exceeds the grace period. - The service is promptly pointed to frontend after downtime ends.
Relates to https://linear.app/sourcegraph/issue/REL-78/when-sourcegraph-frontend-is-down-a-user-trying-to-access-sourcegraph but does not close it.
Stacked on https://github.com/sourcegraph/sourcegraph/pull/64032 but can be reviewed independently. Commit log:
Appliance points ingress-facing service to itself by default
Not frontend.
feat(appliance): healthchecker manages ingress-facing service
Add a new background goroutine to the appliance. It does nothing until a
"begin" channel closes. The idea is that another part of the appliance
will close this channel if the configmap state is set to a post-install
value (or on startup if this is already the case when an appliance boots).
After this barrier is lifted, the healtchecker periodically checks the
readiness (using k8s conditions) of each pod returned by the frontend
deployment's label selector. If even a single pod is ready, it ensures
that the service points to frontend. Otherwise, it waits for a grace
period, checks again, and if downtime persists, it points the service to
the appliance.
This should cover the following cases:
an initial successful install.
exceeds the grace period.
Test plan
Automated tests included that integrate against a real kube-apiserver. We won't know for sure this feature works until we can kick the tires in concert with a few concurrent issues.
Changelog