feat(appliance): healthchecker manages ingress-facing service by craigfurman · Pull Request #64043 · sourcegraph/sourcegraph-public-snapshot

craigfurman · 2024-07-24T14:37:45Z

Relates to https://linear.app/sourcegraph/issue/REL-78/when-sourcegraph-frontend-is-down-a-user-trying-to-access-sourcegraph but does not close it.

Stacked on https://github.com/sourcegraph/sourcegraph/pull/64032 but can be reviewed independently. Commit log:

Appliance points ingress-facing service to itself by default

Not frontend.

feat(appliance): healthchecker manages ingress-facing service

Add a new background goroutine to the appliance. It does nothing until a
"begin" channel closes. The idea is that another part of the appliance
will close this channel if the configmap state is set to a post-install
value (or on startup if this is already the case when an appliance boots).

After this barrier is lifted, the healtchecker periodically checks the
readiness (using k8s conditions) of each pod returned by the frontend
deployment's label selector. If even a single pod is ready, it ensures
that the service points to frontend. Otherwise, it waits for a grace
period, checks again, and if downtime persists, it points the service to
the appliance.

This should cover the following cases:

The service is pointed to frontend after the admin clicks "go" after
an initial successful install.
The service is pointed to appliance after frontend downtime that
exceeds the grace period.
The service is promptly pointed to frontend after downtime ends.

Test plan

Automated tests included that integrate against a real kube-apiserver. We won't know for sure this feature works until we can kick the tires in concert with a few concurrent issues.

Changelog

craigfurman · 2024-07-24T14:38:51Z

Note: cc @jdpleiness. Please see commit message for more context. WDYT of this co-ordination mechanism?

This also relates to the discussion in https://github.com/sourcegraph/sourcegraph/pull/64021#discussion_r1689609233 (cc @Chickensoupwithrice): if we do end up changing reconcileFrontendService() so that it only selectively patches certain object fields, and therefore never clobber any field not in the json patch, this mechanism alone might be enough for the routing fallback to just work! 🪄

WDYT?

Very nice! I think we could close this after we transition to the "refresh" state after the "wait-for-admin" state where the admin switches over to the frontend instance.

Something like: "wait-for-admin" -> (admin clicks launch UI button) -> "refresh state" -> close channel and start monitoring frontend and transition to "maintenance state" in the background

SGTM! Just had another idea: in order to avoid having to handle 2 cases, 1 for transitioning past the admin button and the other for the appliance booting when the configmap is already in a post-install state, we could close this channel in the reconcile loop if the status field demands it.

IIRC booting a controller does fire off reconcile loops for watched resources that already exist, since new watchers are started.

craigfurman · 2024-07-24T15:08:19Z

worth reading this IMO

Yea cool illustration of using the k8sClient

DaedalusG

This looks good to me, I'll likely reuse bits of the logic for polling the states of pods for the maintenance splash page

DaedalusG · 2024-07-24T23:24:11Z

I think this might be a good place for the splash page Im working on:
https://github.com/sourcegraph/sourcegraph/pull/64019

DaedalusG · 2024-07-25T00:26:59Z

Will likely use something like this in the maintenance splash page

DaedalusG · 2024-07-25T00:30:31Z

I feel like we might want to get this as an interface healthChecker implements? Just because later we'll want to be able to allow the user to select Sourcegraph or Maintenance UI. Not for regular users but for admins. Thats for down the line though.

We can always expose it as public later 👍

DaedalusG · 2024-07-25T00:38:03Z

Yea cool illustration of using the k8sClient

Not frontend.

Add a new background goroutine to the appliance. It does nothing until a "begin" channel closes. The idea is that another part of the appliance will close this channel if the configmap state is set to a post-install value (or on startup if this is already the case when an appliance boots). After this barrier is lifted, the healtchecker periodically checks the readiness (using k8s conditions) of each pod returned by the frontend deployment's label selector. If even a single pod is ready, it ensures that the service points to frontend. Otherwise, it waits for a grace period, checks again, and if downtime persists, it points the service to the appliance. This should cover the following cases: - The service is pointed to frontend after the admin clicks "go" after an intial successful install. - The service is pointed to appliance after frontend downtime that exceeds the grace period. - The servie is promptly pointed to frontend after downtime ends.

But we need the appliance to set the status in order to trigger _this_.

**Appliance points ingress-facing service to itself by default** Not frontend. **feat(appliance): healthchecker manages ingress-facing service** Add a new background goroutine to the appliance. It does nothing until a "begin" channel closes. The idea is that another part of the appliance will close this channel if the configmap state is set to a post-install value (or on startup if this is already the case when an appliance boots). After this barrier is lifted, the healtchecker periodically checks the readiness (using k8s conditions) of each pod returned by the frontend deployment's label selector. If even a single pod is ready, it ensures that the service points to frontend. Otherwise, it waits for a grace period, checks again, and if downtime persists, it points the service to the appliance. This should cover the following cases: - The service is pointed to frontend after the admin clicks "go" after an initial successful install. - The service is pointed to appliance after frontend downtime that exceeds the grace period. - The service is promptly pointed to frontend after downtime ends.

craigfurman added the no-changelog Exclude this PR from the next changelog. label Jul 24, 2024

cla-bot Bot added the cla-signed label Jul 24, 2024

craigfurman mentioned this pull request Jul 24, 2024

feat(appliance): routes traffic to frontend post-install #64034

Closed

craigfurman commented Jul 24, 2024

View reviewed changes

craigfurman requested review from a team and DaedalusG and removed request for a team July 24, 2024 14:41

craigfurman mentioned this pull request Jul 24, 2024

feat(appliance): add wait for admin state #64042

Merged

craigfurman commented Jul 24, 2024

View reviewed changes

Base automatically changed from appliance-expose-status to main July 24, 2024 20:09

DaedalusG approved these changes Jul 25, 2024

View reviewed changes

Craig Furman added 3 commits July 25, 2024 09:29

Appliance points ingress-facing service to itself by default

bfa22fe

Not frontend.

Reconciler starts healthcheck loop

0d772c3

But we need the appliance to set the status in order to trigger _this_.

craigfurman force-pushed the appliance-healthcheck-ingress-routing branch from 4435e91 to 0d772c3 Compare July 25, 2024 08:30

craigfurman marked this pull request as ready for review July 25, 2024 13:30

craigfurman merged commit 255e638 into main Jul 25, 2024

craigfurman deleted the appliance-healthcheck-ingress-routing branch July 25, 2024 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(appliance): healthchecker manages ingress-facing service#64043

feat(appliance): healthchecker manages ingress-facing service#64043
craigfurman merged 3 commits into
mainfrom
appliance-healthcheck-ingress-routing

craigfurman commented Jul 24, 2024 •

edited

Loading

Uh oh!

craigfurman Jul 24, 2024

Uh oh!

jdpleiness Jul 24, 2024 •

edited

Loading

Uh oh!

craigfurman Jul 24, 2024

Uh oh!

craigfurman Jul 24, 2024

Uh oh!

DaedalusG Jul 25, 2024

Uh oh!

DaedalusG left a comment

Uh oh!

DaedalusG Jul 24, 2024

Uh oh!

DaedalusG Jul 25, 2024

Uh oh!

DaedalusG Jul 25, 2024

Uh oh!

craigfurman Jul 25, 2024

Uh oh!

DaedalusG Jul 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

craigfurman commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Changelog

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdpleiness Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaedalusG left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

craigfurman commented Jul 24, 2024 •

edited

Loading

jdpleiness Jul 24, 2024 •

edited

Loading