Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

feat(appliance): healthchecker manages ingress-facing service#64043

Merged
craigfurman merged 3 commits into
mainfrom
appliance-healthcheck-ingress-routing
Jul 25, 2024
Merged

feat(appliance): healthchecker manages ingress-facing service#64043
craigfurman merged 3 commits into
mainfrom
appliance-healthcheck-ingress-routing

Conversation

@craigfurman

@craigfurman craigfurman commented Jul 24, 2024

Copy link
Copy Markdown
Contributor

Relates to https://linear.app/sourcegraph/issue/REL-78/when-sourcegraph-frontend-is-down-a-user-trying-to-access-sourcegraph but does not close it.

Stacked on https://github.com/sourcegraph/sourcegraph/pull/64032 but can be reviewed independently. Commit log:

Appliance points ingress-facing service to itself by default

Not frontend.

feat(appliance): healthchecker manages ingress-facing service

Add a new background goroutine to the appliance. It does nothing until a
"begin" channel closes. The idea is that another part of the appliance
will close this channel if the configmap state is set to a post-install
value (or on startup if this is already the case when an appliance boots).

After this barrier is lifted, the healtchecker periodically checks the
readiness (using k8s conditions) of each pod returned by the frontend
deployment's label selector. If even a single pod is ready, it ensures
that the service points to frontend. Otherwise, it waits for a grace
period, checks again, and if downtime persists, it points the service to
the appliance.

This should cover the following cases:

  • The service is pointed to frontend after the admin clicks "go" after
    an initial successful install.
  • The service is pointed to appliance after frontend downtime that
    exceeds the grace period.
  • The service is promptly pointed to frontend after downtime ends.

Test plan

Automated tests included that integrate against a real kube-apiserver. We won't know for sure this feature works until we can kick the tires in concert with a few concurrent issues.

Changelog

@craigfurman craigfurman added the no-changelog Exclude this PR from the next changelog. label Jul 24, 2024
@cla-bot cla-bot Bot added the cla-signed label Jul 24, 2024
Comment thread cmd/appliance/shared/shared.go Outdated
Comment on lines 108 to 112

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: cc @jdpleiness. Please see commit message for more context. WDYT of this co-ordination mechanism?

This also relates to the discussion in https://github.com/sourcegraph/sourcegraph/pull/64021#discussion_r1689609233 (cc @Chickensoupwithrice): if we do end up changing reconcileFrontendService() so that it only selectively patches certain object fields, and therefore never clobber any field not in the json patch, this mechanism alone might be enough for the routing fallback to just work! 🪄

WDYT?

@jdpleiness jdpleiness Jul 24, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I think we could close this after we transition to the "refresh" state after the "wait-for-admin" state where the admin switches over to the frontend instance.

Something like: "wait-for-admin" -> (admin clicks launch UI button) -> "refresh state" -> close channel and start monitoring frontend and transition to "maintenance state" in the background

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM! Just had another idea: in order to avoid having to handle 2 cases, 1 for transitioning past the admin button and the other for the appliance booting when the configmap is already in a post-install state, we could close this channel in the reconcile loop if the status field demands it.

IIRC booting a controller does fire off reconcile loops for watched resources that already exist, since new watchers are started.

@craigfurman craigfurman requested review from a team and DaedalusG and removed request for a team July 24, 2024 14:41

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth reading this IMO

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea cool illustration of using the k8sClient

Base automatically changed from appliance-expose-status to main July 24, 2024 20:09

@DaedalusG DaedalusG left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I'll likely reuse bits of the logic for polling the states of pods for the maintenance splash page

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be a good place for the splash page Im working on:
https://github.com/sourcegraph/sourcegraph/pull/64019
Screenshot 2024-07-24 at 12 00 44 AM

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will likely use something like this in the maintenance splash page

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we might want to get this as an interface healthChecker implements? Just because later we'll want to be able to allow the user to select Sourcegraph or Maintenance UI. Not for regular users but for admins. Thats for down the line though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always expose it as public later 👍

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea cool illustration of using the k8sClient

Craig Furman added 3 commits July 25, 2024 09:29
Add a new background goroutine to the appliance. It does nothing until a
"begin" channel closes. The idea is that another part of the appliance
will close this channel if the configmap state is set to a post-install
value (or on startup if this is already the case when an appliance boots).

After this barrier is lifted, the healtchecker periodically checks the
readiness (using k8s conditions) of each pod returned by the frontend
deployment's label selector. If even a single pod is ready, it ensures
that the service points to frontend. Otherwise, it waits for a grace
period, checks again, and if downtime persists, it points the service to
the appliance.

This should cover the following cases:
- The service is pointed to frontend after the admin clicks "go" after
  an intial successful install.
- The service is pointed to appliance after frontend downtime that
  exceeds the grace period.
- The servie is promptly pointed to frontend after downtime ends.
But we need the appliance to set the status in order to trigger _this_.
@craigfurman craigfurman force-pushed the appliance-healthcheck-ingress-routing branch from 4435e91 to 0d772c3 Compare July 25, 2024 08:30
@craigfurman craigfurman marked this pull request as ready for review July 25, 2024 13:30
@craigfurman craigfurman merged commit 255e638 into main Jul 25, 2024
@craigfurman craigfurman deleted the appliance-healthcheck-ingress-routing branch July 25, 2024 13:40
craigfurman pushed a commit that referenced this pull request Jul 31, 2024
**Appliance points ingress-facing service to itself by default**

Not frontend.

**feat(appliance): healthchecker manages ingress-facing service**

Add a new background goroutine to the appliance. It does nothing until a
"begin" channel closes. The idea is that another part of the appliance
will close this channel if the configmap state is set to a post-install
value (or on startup if this is already the case when an appliance
boots).

After this barrier is lifted, the healtchecker periodically checks the
readiness (using k8s conditions) of each pod returned by the frontend
deployment's label selector. If even a single pod is ready, it ensures
that the service points to frontend. Otherwise, it waits for a grace
period, checks again, and if downtime persists, it points the service to
the appliance.

This should cover the following cases:
- The service is pointed to frontend after the admin clicks "go" after
  an initial successful install.
- The service is pointed to appliance after frontend downtime that
  exceeds the grace period.
- The service is promptly pointed to frontend after downtime ends.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cla-signed no-changelog Exclude this PR from the next changelog.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants