Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

feat(appliance): add wait for admin state#64042

Merged
craigfurman merged 9 commits into
mainfrom
jdp/appliance/07-23-appliance-wait-for-admin
Jul 29, 2024
Merged

feat(appliance): add wait for admin state#64042
craigfurman merged 9 commits into
mainfrom
jdp/appliance/07-23-appliance-wait-for-admin

Conversation

@jdpleiness

@jdpleiness jdpleiness commented Jul 24, 2024

Copy link
Copy Markdown
Contributor

Add basic health check to switch state to waitForAdmin after Sourcegraph frontend is ready. As noted in the code, this is a temporary health check and will/should be replaced with something more comprehensive in the near future.

Wait for admin page successfully appears when Sourcegraph frontend is "ready":
image

Test plan

Tested locally

Changelog

@jdpleiness jdpleiness added the no-changelog Exclude this PR from the next changelog. label Jul 24, 2024
@cla-bot cla-bot Bot added the cla-signed label Jul 24, 2024
@jdpleiness jdpleiness requested a review from craigfurman July 24, 2024 14:34
Comment thread internal/appliance/appliance.go Outdated
Comment on lines 193 to 206

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the comment mentions, this isn't a long term solution. I think we need to have a more comprehensive health check suite, as well as better state control than just setting state inside endpoint handlers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to do some refactoring here as the code didn't work

Comment thread internal/appliance/json.go Outdated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a danger here (and in other handers) that calling these URIs would upset the state and cause a transition to an unwanted state, such as waiting for admin, after install is completed. Happy to chat more about this if needed though.

@craigfurman craigfurman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! After a successful install and I click the button, I get looped back to the install progress bar, which puts me back to the button. I wonder if we should implement a "maintenance" landing page, and ensure that the user never sees the button again after they click it for the first time.

That would require disambiguating between first-install, waiting-for-admin, and post-install states. The state would never return to a pre-install state again after the first admin ack. This also relates to https://github.com/sourcegraph/sourcegraph/pull/64043#discussion_r1689943300: if the appliance boots and a configmap already exists in a post-install state (regardless of what's healthy), close that channel and let the routing fallback do its thing.

We could even have the reconciler close that channel at the start of its loop if it pulls the configmap and sees it in a post-install state. I know we're drifting from this PR but everything feels a bit intertwined atm.

@jdpleiness jdpleiness force-pushed the jdp/appliance/07-23-appliance-wait-for-admin branch from b125890 to 47ccd10 Compare July 25, 2024 19:00
@jdpleiness

Copy link
Copy Markdown
Contributor Author

Chatted a bit about this in pairing. Essentially there can be state before the configMap is created on the initial install of the appliance. Therefore, we need to either track state between to two and sync (probably a bad idea), track "initial install" state on the appliance in memory and use our IsPostInstallStatus func to handle switching from one source in the appliance to the source of truth in the configMap (likely better), or keep it all in memory all the time and we determine state idempotently on appliance startup (my favorite but harder to implement).

I've added an initial status of unknown to the configMap as otherwise we get a panic when the health check channel is closed. I think what's left is to implement the handling of status between pre-install and post-install. @craigfurman I wasn't quite able to put this in the goal hole before signing off 🙏 ⚽ , and I didn't want to make this even worse by half implementing that portion of the logic. Feel free to change whatever you need to get this going.

I think there still might be a panic around that channel hiding somewhere as well as this is coming back in the tests:

E0725 19:40:50.445273      73 controller.go:159] unable to sync kubernetes service: Endpoints "kubernetes" is invalid: subsets[0].addresses[0].ip: Invalid value: "127.0.0.1": may not be in the loopback range (127.0.0.0/8, ::1/128)
I0725 19:40:50.501617      73 alloc.go:330] "allocated clusterIPs" service="test-appliance-20dcb2/sourcegraph-frontend" clusterIPs={"IPv4":"10.0.0.103"}
I0725 19:40:50.509119      73 controller.go:624] quota admission added evaluator for: ingresses.networking.k8s.io
panic: close of nil channel [recovered]
	panic: close of nil channel

goroutine 237 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	external/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:116 +0x2dd
panic({0x2a626a0?, 0x30c5c20?})
	GOROOT/src/runtime/panic.go:770 +0x132
github.com/sourcegraph/sourcegraph/internal/appliance/reconciler.(*Reconciler).Reconcile(0xc000900d50, {0x30e5c48, 0xc0007bab40}, {{{0xc0005de018?, 0x15?}, {0xc0003ae2ce?, 0x2?}}})
	internal/appliance/reconciler/reconcile.go:56 +0x3fb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc0000e77c0, {0x30e5c48, 0xc0007bab40}, {{{0xc0005de018?, 0xb?}, {0xc0003ae2ce?, 0x0?}}})
	external/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:119 +0x1a2
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000e77c0, {0x30e5c80, 0xc0003b18b0}, {0x2b1c9c0, 0xc000251120})
	external/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:316 +0x59b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000e77c0, {0x30e5c80, 0xc0003b18b0})
	external/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:266 +0x339
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	external/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:227 +0xb3
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 165
	external/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:223 +0x80c

@craigfurman

Copy link
Copy Markdown
Contributor

@jdpleiness I see the problem, this is my bad 🤦. The first reconcile after the one the closes the channel will try to close it again, which causes our panic. I'll fix that in a separate PR.

@craigfurman

Copy link
Copy Markdown
Contributor

Fixed in https://github.com/sourcegraph/sourcegraph/pull/64096, let's merge that and rebase this on main. Sorry about that panic!

I'll review this PR more generally later today, hopefully.

@craigfurman

craigfurman commented Jul 26, 2024

Copy link
Copy Markdown
Contributor

I think https://github.com/sourcegraph/sourcegraph/pull/64100, a PR into this PR, gets us closer to that elusive goal hole (can't believe this caught on btw) 🙏

See above questions about what closes the issue, but I don't mind if we merge something that doesn't quite work yet, and iterate from there.

switch (context.stage) {
case 'refresh':
document.location = '/?cacheBust=' + Date.now()
document.location.reload()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should be doing this in a loop (not a tight one, use setInterval() or whatever). The user can't be expected to know that we're moving a proxy pointer behind the scenes, so there's no single good time to click a "take me to code search" button.

Refresh loops look janky for sure, but I'm not sure there's a much better way.

It looks weird that we land back on the appliance front page. Maybe we should use state=refresh to take the user to a page that says something like "prepare to be taken to sourcegraph!". WDYT?

@craigfurman craigfurman self-assigned this Jul 29, 2024
@craigfurman

Copy link
Copy Markdown
Contributor

https://github.com/sourcegraph/sourcegraph/pull/64042/files/b8aab874079bfde5dbab340f8f8d519a4d0a8cac#diff-57a564d9df574eea9118187f70df03077115e1c313d8b64ebb1b7f1cb169c361 is still open, but I think between vacations and time zones, all the branches are getting messy. Let's merge and iterate. https://github.com/sourcegraph/sourcegraph/pull/64100, a follow-on to this, is approved. I'm going to appropriate this PR and merge both of these 🙏

- Set initial status to "unknown" to prevent panic on channel closing when status is "" in the configmap initially.
@craigfurman craigfurman force-pushed the jdp/appliance/07-23-appliance-wait-for-admin branch from b8aab87 to 435024c Compare July 29, 2024 08:43
Craig Furman added 5 commits July 29, 2024 09:44
On every reconcile after the first that sets the configmap to a
post-install status. This was recently introduced.

Fix by only closing the channel once in the reconciler.
It's a new configmap not the existing one. Use a new golang scope to
reflect this.
Unless it doesn't exist, in which case we do not error in order to allow
the state to drive the appliance frontend only.
Our role definition for the appliance in the helm chart only gives
namespaced-scoped permission to lookf for pods, which is all it needs.
@craigfurman craigfurman marked this pull request as ready for review July 29, 2024 09:45

@craigfurman craigfurman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Been kicking the tyres locally with this, we're looking good! I've pulled in commits from https://github.com/sourcegraph/sourcegraph/pull/64100 (see previous comment about vacations/handovers). Normal service on small, cohesive PRs will resume shortly 🙏

@craigfurman craigfurman merged commit 7e82c27 into main Jul 29, 2024
@craigfurman craigfurman deleted the jdp/appliance/07-23-appliance-wait-for-admin branch July 29, 2024 10:21
craigfurman pushed a commit that referenced this pull request Jul 31, 2024
Add basic health check to switch state to `waitForAdmin` after
Sourcegraph frontend is ready. As noted in the code, this is a temporary
health check and will/should be replaced with something more
comprehensive in the near future.

Wait for admin page successfully appears when Sourcegraph frontend is
"ready":

Co-authored-by: Jacob Pleiness <jdpleiness@users.noreply.github.com>
Co-authored-by: Craig Furman <craig.furman@sourcegraph.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cla-signed no-changelog Exclude this PR from the next changelog.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants