Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

fix(appliance): cache authorization status#64213

Merged
craigfurman merged 1 commit into
mainfrom
appliance-cache-authz
Aug 1, 2024
Merged

fix(appliance): cache authorization status#64213
craigfurman merged 1 commit into
mainfrom
appliance-cache-authz

Conversation

@craigfurman

@craigfurman craigfurman commented Aug 1, 2024

Copy link
Copy Markdown
Contributor

In order to reduce the cost of calls to auth-gated endpoints, cache valid admin passwords in-memory. The appliance's frontend calls auth-gated endpoints in a tight loop, and bcrypt checking is intentionally an expensive operation.

This could occasionally cause the appliance-frontend to disconnect from the backend. We observed frontend's nginx reporting an upstream connection close, and exec'ing into its pod and curling the backend regularly hung.

I collected two CPU profiles from the appliance backend: one without this commit, one with it. In both cases, SG was not being installed - the frontend was running, and I had a browser tab open, so that the browser was hitting the backend frequently via the nginx API proxy.

Without this fix:

appliance-cpu-main

With this fix:

appliance-cpu

See the test plan below for how I obtained these CPU profiles.

2 things stand out between them: without this fix, the total CPU time consumed over the 30-second profiling period is 1000s of times larger! On my mac (so not even contending with other processes on a kubernetes node), it used 25 seconds of CPU time - almost saturating a core. We can also see that calls to bcrypt.CompareHashAndPassword() are responsible for all of this.

It's perhaps not ideal from a security perspective to memory-cache the password, but subjectively this trade-off seems like a reasonable way to get moving. Let me know what you think though.

This is a necessary step for https://linear.app/sourcegraph/issue/REL-308/appliance-frontend-seems-to-disconnect-the-backend-during-installation but does not close it. This is because the disconnection bug still occurs, after clicking wait-for-admin, but I think this instance of it is for a different reason. See https://github.com/sourcegraph/sourcegraph/pull/64216 for an explanation and fix of that reason.

Test plan

Starting on the https://github.com/sourcegraph/sourcegraph/pull/64211 branch, not this one:

In one terminal:

export APPLIANCE_PPROF_ADDR=localhost:6061
go run ./cmd/appliance

In another:

cd internal/appliance/frontend/maintenance
pnpm run dev

Navigate to localhost:8889 in a web browser and log into the appliance. You don't need to begin installing SG, just leave the tab open.

In another terminal: go tool pprof -png -output appliance-cpu-main.png 'http://localhost:6061/debug/pprof/profile?seconds=30

Repeat the experiment for this branch (but with https://github.com/sourcegraph/sourcegraph/pull/64211 merged into it, for pprof), and compare profiles.

Finally, I deployed this branch to my local minikube environment to see how it interacted with the ingress stack:

eval $(minikube -p minikube docker-env)
docker rmi appliance:candidate; sg bazel run //cmd/appliance:image_tarball
docker tag appliance:candidate index.docker.io/sourcegraph/appliance:candidate

# cd to the helm chart repo
helm upgrade --install --namespace test \
  --set noResourceRestrictions=true --set image.tag=candidate --set selfUpdate.enabled=false \
  appliance ./charts/sourcegraph-appliance

I saw no disconnections during SG's installation, until the race condition I described further up kicked in after clicking wait-for-admin, and I had to refresh the page in order to see site-admin.

Changelog

In order to reduce the cost of calls to auth-gated endpoints, cache
valid admin passwords in-memory. The appliance's frontend calls
auth-gated endpoints in a tight loop, and bcrypt checking is
intentionally an expensive operation.

This could occasionally cause the appliance-frontend to disconnect from
the backend. We observed frontend's nginx reporting an upstream
connection close, and exec'ing into its pod and curling the backend
regularly hung.
@craigfurman craigfurman requested review from a team and Chickensoupwithrice and removed request for a team August 1, 2024 11:41
@cla-bot cla-bot Bot added the cla-signed label Aug 1, 2024
craigfurman referenced this pull request Aug 1, 2024
When the admin has first installed Code Search Suite, the appliance
waits for the admin to click an "I'm ready" button. This causes the
appliance to unblock a background thread that periodically checks the
health of sg-frontend. When it is healthy, it ensures that the
ingress-facing frontend is pointed to sg-frontend. And when it is not,
it points to the appliance-frontend. Pointing to the appliance-frontend
is its initial state pre-install, and given that we've just installed
sg, the appliance switches the service over quickly.

Meanwhile, clicking this button transitions the frontend to a "refresh"
state (this being one of the states in its state machine). This causes
the UI to reload the web page. The reason we have to do this is that it
is a way to "redirect to yourself". If the ingress-facing service has
been repointed, refreshing like this will show site-admin, which is the
desired behavior. The issue this commit fixes, is that this is racy:
upon refresh, the browser tab queries the appliance (via an nginx proxy
hosted on the same domain serving appliance-frontend) for its state. We
have to store state on the backend (specifically, we use a ConfigMap
annotation), so that the appliance can do the right thing if it has been
rebooted at any time. This will help power future features such as
UI-driven upgrades. The race occurs if, upon refresh, the ingress-facing
service has been flipped over to sg-frontend. The appliance API that
answered the state questions is no longer available!

In general, we can't tell the difference between this expected turn of
events, and a state in which the backend can't be reached. This commit
mitigates the race by setting the appliance UI to refresh if it cannot
reach the appliance API. This looks no different to a "disconnected"
state if things really are broken, but in the expected path, it will
resolve the race by retrying.

This commit reliably causes the appliance-driven installation flow to
redirect to site-admin after clicking "ready", according to my
experimentation in minikube. I suspect that this would be the case even
without https://github.com/sourcegraph/sourcegraph/pull/64213, which
fixes an unrelated performance issue. I suspect we need both, otherwise
the appliance UI will regularly disconnect for prolonged periods of
time, which is confusing.
@craigfurman craigfurman requested a review from willdollman August 1, 2024 16:13
craigfurman referenced this pull request Aug 1, 2024
When the admin has first installed Code Search Suite, the appliance
waits for the admin to click an "I'm ready" button. This causes the
appliance to unblock a background thread that periodically checks the
health of sg-frontend. When it is healthy, it ensures that the
ingress-facing frontend is pointed to sg-frontend. And when it is not,
it points to the appliance-frontend. Pointing to the appliance-frontend
is its initial state pre-install, and given that we've just installed
sg, the appliance switches the service over quickly.

Meanwhile, clicking this button transitions the frontend to a "refresh"
state (this being one of the states in its state machine). This causes
the UI to reload the web page. The reason we have to do this is that it
is a way to "redirect to yourself". If the ingress-facing service has
been repointed, refreshing like this will show site-admin, which is the
desired behavior. The issue this commit fixes, is that this is racy:
upon refresh, the browser tab queries the appliance (via an nginx proxy
hosted on the same domain serving appliance-frontend) for its state. We
have to store state on the backend (specifically, we use a ConfigMap
annotation), so that the appliance can do the right thing if it has been
rebooted at any time. This will help power future features such as
UI-driven upgrades. The race occurs if, upon refresh, the ingress-facing
service has been flipped over to sg-frontend. The appliance API that
answered the state questions is no longer available!

In general, we can't tell the difference between this expected turn of
events, and a state in which the backend can't be reached. This commit
mitigates the race by setting the appliance UI to refresh if it cannot
reach the appliance API. This looks no different to a "disconnected"
state if things really are broken, but in the expected path, it will
resolve the race by retrying.

This commit reliably causes the appliance-driven installation flow to
redirect to site-admin after clicking "ready", according to my
experimentation in minikube. I suspect that this would be the case even
without https://github.com/sourcegraph/sourcegraph/pull/64213, which
fixes an unrelated performance issue. I suspect we need both, otherwise
the appliance UI will regularly disconnect for prolonged periods of
time, which is confusing.

Closes
https://linear.app/sourcegraph/issue/REL-308/appliance-frontend-seems-to-disconnect-the-backend-during-installation
@craigfurman craigfurman merged commit 156aa5a into main Aug 1, 2024
@craigfurman craigfurman deleted the appliance-cache-authz branch August 1, 2024 16:38
sourcegraph-release-bot pushed a commit that referenced this pull request Aug 1, 2024
In order to reduce the cost of calls to auth-gated endpoints, cache
valid admin passwords in-memory. The appliance's frontend calls
auth-gated endpoints in a tight loop, and bcrypt checking is
intentionally an expensive operation.

This could occasionally cause the appliance-frontend to disconnect from
the backend. We observed frontend's nginx reporting an upstream
connection close, and exec'ing into its pod and curling the backend
regularly hung.

(cherry picked from commit 156aa5a)
craigfurman pushed a commit that referenced this pull request Aug 1, 2024
Backport 156aa5a from #64213

Co-authored-by: Craig Furman <craig.furman@sourcegraph.com>
craigfurman referenced this pull request Aug 1, 2024
When the admin has first installed Code Search Suite, the appliance
waits for the admin to click an "I'm ready" button. This causes the
appliance to unblock a background thread that periodically checks the
health of sg-frontend. When it is healthy, it ensures that the
ingress-facing frontend is pointed to sg-frontend. And when it is not,
it points to the appliance-frontend. Pointing to the appliance-frontend
is its initial state pre-install, and given that we've just installed
sg, the appliance switches the service over quickly.

Meanwhile, clicking this button transitions the frontend to a "refresh"
state (this being one of the states in its state machine). This causes
the UI to reload the web page. The reason we have to do this is that it
is a way to "redirect to yourself". If the ingress-facing service has
been repointed, refreshing like this will show site-admin, which is the
desired behavior. The issue this commit fixes, is that this is racy:
upon refresh, the browser tab queries the appliance (via an nginx proxy
hosted on the same domain serving appliance-frontend) for its state. We
have to store state on the backend (specifically, we use a ConfigMap
annotation), so that the appliance can do the right thing if it has been
rebooted at any time. This will help power future features such as
UI-driven upgrades. The race occurs if, upon refresh, the ingress-facing
service has been flipped over to sg-frontend. The appliance API that
answered the state questions is no longer available!

In general, we can't tell the difference between this expected turn of
events, and a state in which the backend can't be reached. This commit
mitigates the race by setting the appliance UI to refresh if it cannot
reach the appliance API. This looks no different to a "disconnected"
state if things really are broken, but in the expected path, it will
resolve the race by retrying.

This commit reliably causes the appliance-driven installation flow to
redirect to site-admin after clicking "ready", according to my
experimentation in minikube. I suspect that this would be the case even
without https://github.com/sourcegraph/sourcegraph/pull/64213, which
fixes an unrelated performance issue. I suspect we need both, otherwise
the appliance UI will regularly disconnect for prolonged periods of
time, which is confusing.

Closes
https://linear.app/sourcegraph/issue/REL-308/appliance-frontend-seems-to-disconnect-the-backend-during-installation

(cherry picked from commit e54407d)

@evict evict left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants