[Backport 5.5.x] fix(appliance): cache authorization status by sourcegraph-release-bot · Pull Request #64219 · sourcegraph/sourcegraph-public-snapshot

sourcegraph-release-bot · 2024-08-01T16:42:12Z

In order to reduce the cost of calls to auth-gated endpoints, cache valid admin passwords in-memory. The appliance's frontend calls auth-gated endpoints in a tight loop, and bcrypt checking is intentionally an expensive operation.

This could occasionally cause the appliance-frontend to disconnect from the backend. We observed frontend's nginx reporting an upstream connection close, and exec'ing into its pod and curling the backend regularly hung.

I collected two CPU profiles from the appliance backend: one without this commit, one with it. In both cases, SG was not being installed - the frontend was running, and I had a browser tab open, so that the browser was hitting the backend frequently via the nginx API proxy.

Without this fix:

With this fix:

See the test plan below for how I obtained these CPU profiles.

2 things stand out between them: without this fix, the total CPU time consumed over the 30-second profiling period is 1000s of times larger! On my mac (so not even contending with other processes on a kubernetes node), it used 25 seconds of CPU time - almost saturating a core. We can also see that calls to bcrypt.CompareHashAndPassword() are responsible for all of this.

It's perhaps not ideal from a security perspective to memory-cache the password, but subjectively this trade-off seems like a reasonable way to get moving. Let me know what you think though.

This is a necessary step for https://linear.app/sourcegraph/issue/REL-308/appliance-frontend-seems-to-disconnect-the-backend-during-installation but does not close it. This is because the disconnection bug still occurs, after clicking wait-for-admin, but I think this instance of it is for a different reason. See https://github.com/sourcegraph/sourcegraph/pull/64216 for an explanation and fix of that reason.

Test plan

Starting on the https://github.com/sourcegraph/sourcegraph/pull/64211 branch, not this one:

In one terminal:

export APPLIANCE_PPROF_ADDR=localhost:6061
go run ./cmd/appliance

In another:

cd internal/appliance/frontend/maintenance
pnpm run dev

Navigate to localhost:8889 in a web browser and log into the appliance. You don't need to begin installing SG, just leave the tab open.

In another terminal: go tool pprof -png -output appliance-cpu-main.png 'http://localhost:6061/debug/pprof/profile?seconds=30

Repeat the experiment for this branch (but with https://github.com/sourcegraph/sourcegraph/pull/64211 merged into it, for pprof), and compare profiles.

Finally, I deployed this branch to my local minikube environment to see how it interacted with the ingress stack:

eval $(minikube -p minikube docker-env)
docker rmi appliance:candidate; sg bazel run //cmd/appliance:image_tarball
docker tag appliance:candidate index.docker.io/sourcegraph/appliance:candidate

# cd to the helm chart repo
helm upgrade --install --namespace test \
  --set noResourceRestrictions=true --set image.tag=candidate --set selfUpdate.enabled=false \
  appliance ./charts/sourcegraph-appliance

I saw no disconnections during SG's installation, until the race condition I described further up kicked in after clicking wait-for-admin, and I had to refresh the page in order to see site-admin.

Changelog

Backport 156aa5a from #64213

In order to reduce the cost of calls to auth-gated endpoints, cache valid admin passwords in-memory. The appliance's frontend calls auth-gated endpoints in a tight loop, and bcrypt checking is intentionally an expensive operation. This could occasionally cause the appliance-frontend to disconnect from the backend. We observed frontend's nginx reporting an upstream connection close, and exec'ing into its pod and curling the backend regularly hung. (cherry picked from commit 156aa5a)

sourcegraph-release-bot requested review from a team and craigfurman August 1, 2024 16:42

sourcegraph-release-bot added cla-signed backports backported-to-5.5.x labels Aug 1, 2024

Chickensoupwithrice approved these changes Aug 1, 2024

View reviewed changes

craigfurman enabled auto-merge (squash) August 1, 2024 16:52

craigfurman merged commit 17871a4 into 5.5.x Aug 1, 2024

craigfurman deleted the backport-64213-to-5.5.x branch August 1, 2024 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport 5.5.x] fix(appliance): cache authorization status#64219

[Backport 5.5.x] fix(appliance): cache authorization status#64219
craigfurman merged 1 commit into
5.5.xfrom
backport-64213-to-5.5.x

sourcegraph-release-bot commented Aug 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sourcegraph-release-bot commented Aug 1, 2024

Test plan

Changelog

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants