feat(appliance): add wait for admin state by jdpleiness · Pull Request #64042 · sourcegraph/sourcegraph-public-snapshot

jdpleiness · 2024-07-24T14:33:11Z

Add basic health check to switch state to waitForAdmin after Sourcegraph frontend is ready. As noted in the code, this is a temporary health check and will/should be replaced with something more comprehensive in the near future.

Wait for admin page successfully appears when Sourcegraph frontend is "ready":

Test plan

Tested locally

Changelog

jdpleiness · 2024-07-24T14:37:04Z

As the comment mentions, this isn't a long term solution. I think we need to have a more comprehensive health check suite, as well as better state control than just setting state inside endpoint handlers.

jdpleiness · 2024-07-24T14:37:26Z

I had to do some refactoring here as the code didn't work

jdpleiness · 2024-07-24T14:38:59Z

There is a danger here (and in other handers) that calling these URIs would upset the state and cause a transition to an unwanted state, such as waiting for admin, after install is completed. Happy to chat more about this if needed though.

craigfurman

Nice! After a successful install and I click the button, I get looped back to the install progress bar, which puts me back to the button. I wonder if we should implement a "maintenance" landing page, and ensure that the user never sees the button again after they click it for the first time.

That would require disambiguating between first-install, waiting-for-admin, and post-install states. The state would never return to a pre-install state again after the first admin ack. This also relates to https://github.com/sourcegraph/sourcegraph/pull/64043#discussion_r1689943300: if the appliance boots and a configmap already exists in a post-install state (regardless of what's healthy), close that channel and let the routing fallback do its thing.

We could even have the reconciler close that channel at the start of its loop if it pulls the configmap and sees it in a post-install state. I know we're drifting from this PR but everything feels a bit intertwined atm.

jdpleiness · 2024-07-25T20:42:18Z

Chatted a bit about this in pairing. Essentially there can be state before the configMap is created on the initial install of the appliance. Therefore, we need to either track state between to two and sync (probably a bad idea), track "initial install" state on the appliance in memory and use our IsPostInstallStatus func to handle switching from one source in the appliance to the source of truth in the configMap (likely better), or keep it all in memory all the time and we determine state idempotently on appliance startup (my favorite but harder to implement).

I've added an initial status of unknown to the configMap as otherwise we get a panic when the health check channel is closed. I think what's left is to implement the handling of status between pre-install and post-install. @craigfurman I wasn't quite able to put this in the goal hole before signing off 🙏 ⚽ , and I didn't want to make this even worse by half implementing that portion of the logic. Feel free to change whatever you need to get this going.

I think there still might be a panic around that channel hiding somewhere as well as this is coming back in the tests:

E0725 19:40:50.445273      73 controller.go:159] unable to sync kubernetes service: Endpoints "kubernetes" is invalid: subsets[0].addresses[0].ip: Invalid value: "127.0.0.1": may not be in the loopback range (127.0.0.0/8, ::1/128)
I0725 19:40:50.501617      73 alloc.go:330] "allocated clusterIPs" service="test-appliance-20dcb2/sourcegraph-frontend" clusterIPs={"IPv4":"10.0.0.103"}
I0725 19:40:50.509119      73 controller.go:624] quota admission added evaluator for: ingresses.networking.k8s.io
panic: close of nil channel [recovered]
	panic: close of nil channel

goroutine 237 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	external/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:116 +0x2dd
panic({0x2a626a0?, 0x30c5c20?})
	GOROOT/src/runtime/panic.go:770 +0x132
github.com/sourcegraph/sourcegraph/internal/appliance/reconciler.(*Reconciler).Reconcile(0xc000900d50, {0x30e5c48, 0xc0007bab40}, {{{0xc0005de018?, 0x15?}, {0xc0003ae2ce?, 0x2?}}})
	internal/appliance/reconciler/reconcile.go:56 +0x3fb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc0000e77c0, {0x30e5c48, 0xc0007bab40}, {{{0xc0005de018?, 0xb?}, {0xc0003ae2ce?, 0x0?}}})
	external/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:119 +0x1a2
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000e77c0, {0x30e5c80, 0xc0003b18b0}, {0x2b1c9c0, 0xc000251120})
	external/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:316 +0x59b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000e77c0, {0x30e5c80, 0xc0003b18b0})
	external/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:266 +0x339
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	external/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:227 +0xb3
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 165
	external/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:223 +0x80c

craigfurman · 2024-07-26T08:20:01Z

@jdpleiness I see the problem, this is my bad 🤦. The first reconcile after the one the closes the channel will try to close it again, which causes our panic. I'll fix that in a separate PR.

craigfurman · 2024-07-26T08:49:49Z

Fixed in https://github.com/sourcegraph/sourcegraph/pull/64096, let's merge that and rebase this on main. Sorry about that panic!

I'll review this PR more generally later today, hopefully.

craigfurman · 2024-07-26T11:07:41Z

I think https://github.com/sourcegraph/sourcegraph/pull/64100, a PR into this PR, gets us closer to that elusive goal hole (can't believe this caught on btw) 🙏

See above questions about what closes the issue, but I don't mind if we merge something that doesn't quite work yet, and iterate from there.

craigfurman · 2024-07-26T15:58:24Z

    switch (context.stage) {
        case 'refresh':
-            document.location = '/?cacheBust=' + Date.now()
+            document.location.reload()


Maybe we should be doing this in a loop (not a tight one, use setInterval() or whatever). The user can't be expected to know that we're moving a proxy pointer behind the scenes, so there's no single good time to click a "take me to code search" button.

Refresh loops look janky for sure, but I'm not sure there's a much better way.

It looks weird that we land back on the appliance front page. Maybe we should use state=refresh to take the user to a page that says something like "prepare to be taken to sourcegraph!". WDYT?

craigfurman · 2024-07-29T08:10:41Z

https://github.com/sourcegraph/sourcegraph/pull/64042/files/b8aab874079bfde5dbab340f8f8d519a4d0a8cac#diff-57a564d9df574eea9118187f70df03077115e1c313d8b64ebb1b7f1cb169c361 is still open, but I think between vacations and time zones, all the branches are getting messy. Let's merge and iterate. https://github.com/sourcegraph/sourcegraph/pull/64100, a follow-on to this, is approved. I'm going to appropriate this PR and merge both of these 🙏

- Set initial status to "unknown" to prevent panic on channel closing when status is "" in the configmap initially.

On every reconcile after the first that sets the configmap to a post-install status. This was recently introduced. Fix by only closing the channel once in the reconciler.

It's a new configmap not the existing one. Use a new golang scope to reflect this.

Unless it doesn't exist, in which case we do not error in order to allow the state to drive the appliance frontend only.

Rather than backend.

Our role definition for the appliance in the helm chart only gives namespaced-scoped permission to lookf for pods, which is all it needs.

craigfurman

Been kicking the tyres locally with this, we're looking good! I've pulled in commits from https://github.com/sourcegraph/sourcegraph/pull/64100 (see previous comment about vacations/handovers). Normal service on small, cohesive PRs will resume shortly 🙏

Add basic health check to switch state to `waitForAdmin` after Sourcegraph frontend is ready. As noted in the code, this is a temporary health check and will/should be replaced with something more comprehensive in the near future. Wait for admin page successfully appears when Sourcegraph frontend is "ready": Co-authored-by: Jacob Pleiness <jdpleiness@users.noreply.github.com> Co-authored-by: Craig Furman <craig.furman@sourcegraph.com>

jdpleiness added the no-changelog Exclude this PR from the next changelog. label Jul 24, 2024

cla-bot Bot added the cla-signed label Jul 24, 2024

jdpleiness requested a review from craigfurman July 24, 2024 14:34

jdpleiness commented Jul 24, 2024

View reviewed changes

craigfurman reviewed Jul 24, 2024

View reviewed changes

jdpleiness force-pushed the jdp/appliance/07-23-appliance-wait-for-admin branch from b125890 to 47ccd10 Compare July 25, 2024 19:00

craigfurman mentioned this pull request Jul 26, 2024

fix(appliance): transpose state to configmap when possible #64100

Closed

craigfurman reviewed Jul 26, 2024

View reviewed changes

craigfurman self-assigned this Jul 29, 2024

jdpleiness added 4 commits July 29, 2024 09:43

add healthcheck and control wfa state

8461e10

fix formatting

048f2be

resolve merge conflicts

d09fe06

Set initial appliance cfgMap status

435024c

- Set initial status to "unknown" to prevent panic on channel closing when status is "" in the configmap initially.

craigfurman force-pushed the jdp/appliance/07-23-appliance-wait-for-admin branch from b8aab87 to 435024c Compare July 29, 2024 08:43

Craig Furman added 5 commits July 29, 2024 09:44

fix(appliance): panic on nil channel

70a35a9

On every reconcile after the first that sets the configmap to a post-install status. This was recently introduced. Fix by only closing the channel once in the reconciler.

Make a variable name less confusing

a4edfed

It's a new configmap not the existing one. Use a new golang scope to reflect this.

Set configmap status on every state transition

1e0bde8

Unless it doesn't exist, in which case we do not error in order to allow the state to drive the appliance frontend only.

fix(appliance): point ingress-facing service at frontend

dc727cb

Rather than backend.

fix(appliance): healthchecker should search for pods in 1 namespace

f23c534

Our role definition for the appliance in the helm chart only gives namespaced-scoped permission to lookf for pods, which is all it needs.

craigfurman marked this pull request as ready for review July 29, 2024 09:45

craigfurman approved these changes Jul 29, 2024

View reviewed changes

craigfurman merged commit 7e82c27 into main Jul 29, 2024

craigfurman deleted the jdp/appliance/07-23-appliance-wait-for-admin branch July 29, 2024 10:21

craigfurman mentioned this pull request Jul 29, 2024

fix(appliance): panic on nil channel #64096

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(appliance): add wait for admin state#64042

feat(appliance): add wait for admin state#64042
craigfurman merged 9 commits into
mainfrom
jdp/appliance/07-23-appliance-wait-for-admin

jdpleiness commented Jul 24, 2024 •

edited

Loading

Uh oh!

jdpleiness Jul 24, 2024

Uh oh!

jdpleiness Jul 24, 2024

Uh oh!

jdpleiness Jul 24, 2024

Uh oh!

craigfurman left a comment •

edited

Loading

Uh oh!

jdpleiness commented Jul 25, 2024

Uh oh!

craigfurman commented Jul 26, 2024

Uh oh!

craigfurman commented Jul 26, 2024

Uh oh!

craigfurman commented Jul 26, 2024 •

edited

Loading

Uh oh!

craigfurman Jul 26, 2024

Uh oh!

craigfurman commented Jul 29, 2024

Uh oh!

craigfurman left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jdpleiness commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Changelog

Uh oh!

jdpleiness Jul 24, 2024

Choose a reason for hiding this comment

Uh oh!

jdpleiness Jul 24, 2024

Choose a reason for hiding this comment

Uh oh!

jdpleiness Jul 24, 2024

Choose a reason for hiding this comment

Uh oh!

craigfurman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdpleiness commented Jul 25, 2024

Uh oh!

craigfurman commented Jul 26, 2024

Uh oh!

craigfurman commented Jul 26, 2024

Uh oh!

craigfurman commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

craigfurman Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

craigfurman commented Jul 29, 2024

Uh oh!

craigfurman left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jdpleiness commented Jul 24, 2024 •

edited

Loading

craigfurman left a comment •

edited

Loading

craigfurman commented Jul 26, 2024 •

edited

Loading