feat(appliance): add wait for admin state#64042
Conversation
There was a problem hiding this comment.
As the comment mentions, this isn't a long term solution. I think we need to have a more comprehensive health check suite, as well as better state control than just setting state inside endpoint handlers.
There was a problem hiding this comment.
I had to do some refactoring here as the code didn't work
There was a problem hiding this comment.
There is a danger here (and in other handers) that calling these URIs would upset the state and cause a transition to an unwanted state, such as waiting for admin, after install is completed. Happy to chat more about this if needed though.
There was a problem hiding this comment.
Nice! After a successful install and I click the button, I get looped back to the install progress bar, which puts me back to the button. I wonder if we should implement a "maintenance" landing page, and ensure that the user never sees the button again after they click it for the first time.
That would require disambiguating between first-install, waiting-for-admin, and post-install states. The state would never return to a pre-install state again after the first admin ack. This also relates to https://github.com/sourcegraph/sourcegraph/pull/64043#discussion_r1689943300: if the appliance boots and a configmap already exists in a post-install state (regardless of what's healthy), close that channel and let the routing fallback do its thing.
We could even have the reconciler close that channel at the start of its loop if it pulls the configmap and sees it in a post-install state. I know we're drifting from this PR but everything feels a bit intertwined atm.
b125890 to
47ccd10
Compare
|
Chatted a bit about this in pairing. Essentially there can be state before the configMap is created on the initial install of the appliance. Therefore, we need to either track state between to two and sync (probably a bad idea), track "initial install" state on the appliance in memory and use our I've added an initial status of I think there still might be a panic around that channel hiding somewhere as well as this is coming back in the tests: |
|
@jdpleiness I see the problem, this is my bad 🤦. The first reconcile after the one the closes the channel will try to close it again, which causes our panic. I'll fix that in a separate PR. |
|
Fixed in https://github.com/sourcegraph/sourcegraph/pull/64096, let's merge that and rebase this on main. Sorry about that panic! I'll review this PR more generally later today, hopefully. |
|
I think https://github.com/sourcegraph/sourcegraph/pull/64100, a PR into this PR, gets us closer to that elusive goal hole (can't believe this caught on btw) 🙏 See above questions about what closes the issue, but I don't mind if we merge something that doesn't quite work yet, and iterate from there. |
| switch (context.stage) { | ||
| case 'refresh': | ||
| document.location = '/?cacheBust=' + Date.now() | ||
| document.location.reload() |
There was a problem hiding this comment.
Maybe we should be doing this in a loop (not a tight one, use setInterval() or whatever). The user can't be expected to know that we're moving a proxy pointer behind the scenes, so there's no single good time to click a "take me to code search" button.
Refresh loops look janky for sure, but I'm not sure there's a much better way.
It looks weird that we land back on the appliance front page. Maybe we should use state=refresh to take the user to a page that says something like "prepare to be taken to sourcegraph!". WDYT?
|
https://github.com/sourcegraph/sourcegraph/pull/64042/files/b8aab874079bfde5dbab340f8f8d519a4d0a8cac#diff-57a564d9df574eea9118187f70df03077115e1c313d8b64ebb1b7f1cb169c361 is still open, but I think between vacations and time zones, all the branches are getting messy. Let's merge and iterate. https://github.com/sourcegraph/sourcegraph/pull/64100, a follow-on to this, is approved. I'm going to appropriate this PR and merge both of these 🙏 |
- Set initial status to "unknown" to prevent panic on channel closing when status is "" in the configmap initially.
b8aab87 to
435024c
Compare
On every reconcile after the first that sets the configmap to a post-install status. This was recently introduced. Fix by only closing the channel once in the reconciler.
It's a new configmap not the existing one. Use a new golang scope to reflect this.
Unless it doesn't exist, in which case we do not error in order to allow the state to drive the appliance frontend only.
Rather than backend.
Our role definition for the appliance in the helm chart only gives namespaced-scoped permission to lookf for pods, which is all it needs.
craigfurman
left a comment
There was a problem hiding this comment.
Been kicking the tyres locally with this, we're looking good! I've pulled in commits from https://github.com/sourcegraph/sourcegraph/pull/64100 (see previous comment about vacations/handovers). Normal service on small, cohesive PRs will resume shortly 🙏
Add basic health check to switch state to `waitForAdmin` after Sourcegraph frontend is ready. As noted in the code, this is a temporary health check and will/should be replaced with something more comprehensive in the near future. Wait for admin page successfully appears when Sourcegraph frontend is "ready": Co-authored-by: Jacob Pleiness <jdpleiness@users.noreply.github.com> Co-authored-by: Craig Furman <craig.furman@sourcegraph.com>
Add basic health check to switch state to
waitForAdminafter Sourcegraph frontend is ready. As noted in the code, this is a temporary health check and will/should be replaced with something more comprehensive in the near future.Wait for admin page successfully appears when Sourcegraph frontend is "ready":

Test plan
Tested locally
Changelog