Add a startup probe to codeintel-db by chrismwendt · Pull Request #4136 · sourcegraph/deploy-sourcegraph

chrismwendt · 2022-06-01T04:52:37Z

Prior to this change, it was easy for codeintel-db to enter an infinite kill+restart loop in the event that Postgres had to recover database state from an OOM death. A customer ran into a situation where Postgres took ~5 minutes to recover with 11GB Rockskip tables.

After this change, Postgres will be given 1 hour to start. Is that enough time, or too long? It's a balance between:

Reducing the likelihood customers get stuck and require manual intervention when Postgres is recovering with large tables (we have concrete evidence this happens, see above)
Restarting Postgres quickly when there's some bad pod state lying around and a restart would fix it (hypothetical, seems relatively unlikely)

~~Ideally, the startup probe would know if Postgres is making progress, but I don't know how to easily teach it and for the check to be reliable.~~ Edit: I don't think it would matter because all we can tell k8s is whether or not the startup probe succeeded, not alter the failureThreshold and bail early.

Checklist

CHANGELOG.md updated https://github.com/sourcegraph/sourcegraph/pull/36408
K8s Upgrade notes updated https://github.com/sourcegraph/sourcegraph/pull/36408
Sister deploy-sourcegraph-docker change: Make the Postgres healthchecks more lenient deploy-sourcegraph-docker#824
All images have a valid tag and SHA256 sum

Test plan

Details

```yaml apiVersion: apps/v1 kind: Deployment metadata: name: depl spec: replicas: 1 selector: matchLabels: component: web strategy: type: Recreate template: metadata: labels: component: web spec: terminationGracePeriodSeconds: 0 containers: - name: foo image: python command: ["bash"] args: ["-c", "trap SIGTERM exit; while true; do echo >> /log; wc -c /log; sleep 1; done"] startupProbe: exec: command: - "python3" - "-c" - "from pathlib import Path; import sys; sys.exit(1 if len(Path('/log').read_bytes()) < 5 else 0)" failureThreshold: 10 periodSeconds: 1 livenessProbe: exec: command: - "false" failureThreshold: 3 periodSeconds: 1 ```

kubectl apply -f deployment.yaml
Wait 5s, startup probe succeeds, container is up
Wait 3s, liveness probe fails, k8s kills container

efritz · 2022-06-01T14:08:11Z

          exec:
            command:
              - /liveness.sh
+        startupProbe:


Should we be doing this for both databases?

Added to frontend, too ✅

Would it be possible to ensure that the database is actually starting up and not unresponsive via pg_isready?

Basically in the startup condition we want:

a small window where postgres isn't bound to the socket

a larger window where postgres is actively not accepting connections ("server is starting" error from clients) but replaying the WAL

Would that help? Whether the startup probe fails because Postgres hasn't bound to the socket yet or it's rejecting connections, the probe would return a failure either way. K8s would continue probing until it uses up all of the failureThreshold retries.

The only way I can see any benefit from pg_isready is if it's use is split across probes:

Startup probe (short timeout?): succeeds when Postgres has bound to the socket (but might not be accepting connections yet)

Liveness probe (long timeout?): succeeds when Postgres is accepting connections

Even if we did that, the only potential benefit I see is in the case where Postgres is failing to bind to the socket due to some bad temporary state in the container that would get wiped upon restart. K8s would restart the container and Postgres would become ready after the shorter timeout rather than the longer timeout. That case doesn't seem very likely.

My idea as code!

What you have now:

for i = 0; i < 360 && !started(); i++ { sleep 10s }

What pg_isready would allow:

for i = 0; i < 360 && !started() && starting(); i++ { sleep 10s }

The difference is that the first one will continue to loop on more critical conditions besides the database starting up. If the database doesn't start at all it might take an hour to restart.

K8s would continue probing until it uses up all of the failureThreshold retries.

Ooh I see now. We can't say "hard exit" from the startup probe.

Won't the first one continue to loop on fewer conditions? The second one has the additional condition && starting(). 🤔

Ideally we'd be able to do the following but I'm not sure k8s probes are enough to accomplish it:

if we're not bound or starting up (after like 10s) then exit HARD because postgres isn't doing anything

if we're starting up keep polling

if we're ready succeed the probe and go ready

Not sure we can distinguish the last two. Failures are failures (unless there's some exit code shenanigans we can do, like exit 256 sends a different signal to k8s).

I don't think probes can hard exit https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes

efritz

LGTM from what I understand of probes

DaedalusG · 2022-06-02T23:03:42Z

Yay!

more lenient probes

5043ba5

chrismwendt requested a review from efritz June 1, 2022 04:52

efritz reviewed Jun 1, 2022

View reviewed changes

frontend, too

c1474a0

efritz approved these changes Jun 1, 2022

View reviewed changes

chrismwendt enabled auto-merge (squash) June 1, 2022 19:03

chrismwendt mentioned this pull request Jun 1, 2022

Add docs for new startup probes on Postgres containers sourcegraph/sourcegraph-public-snapshot#36408

Merged

chrismwendt merged commit 683e455 into master Jun 1, 2022

chrismwendt deleted the lenient-codeintel-db-probes branch June 1, 2022 19:22

chrismwendt mentioned this pull request Jun 1, 2022

Make the Postgres healthchecks more lenient sourcegraph/deploy-sourcegraph-docker#824

Merged

2 tasks

caugustus-sourcegraph mentioned this pull request Jun 13, 2022

Add db startup probes to helm sourcegraph/sourcegraph-public-snapshot#37176

Closed

eseliger added a commit to sourcegraph/deploy-sourcegraph-helm that referenced this pull request Jun 13, 2022

Backport fix from sourcegraph/deploy-sourcegraph#4136

6148259

caugustus-sourcegraph pushed a commit to sourcegraph/deploy-sourcegraph-helm that referenced this pull request Jun 13, 2022

Backport fix from sourcegraph/deploy-sourcegraph#4136 (#133)

a51e261

chrismwendt mentioned this pull request Jul 6, 2022

alerts: Do not alert on long transactions in codeintel-db sourcegraph/sourcegraph-public-snapshot#36619

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a startup probe to codeintel-db#4136

Add a startup probe to codeintel-db#4136
chrismwendt merged 2 commits into
masterfrom
lenient-codeintel-db-probes

chrismwendt commented Jun 1, 2022 •

edited

Loading

Uh oh!

efritz Jun 1, 2022

Uh oh!

chrismwendt Jun 1, 2022

Uh oh!

chrismwendt Jun 1, 2022

Uh oh!

efritz Jun 1, 2022 •

edited by chrismwendt

Loading

Uh oh!

chrismwendt Jun 1, 2022 •

edited

Loading

Uh oh!

efritz Jun 1, 2022 •

edited by chrismwendt

Loading

Uh oh!

efritz Jun 1, 2022

Uh oh!

chrismwendt Jun 1, 2022

Uh oh!

efritz Jun 1, 2022

Uh oh!

chrismwendt Jun 1, 2022

Uh oh!

efritz left a comment

Uh oh!

DaedalusG commented Jun 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chrismwendt commented Jun 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Test plan

Uh oh!

efritz Jun 1, 2022

Choose a reason for hiding this comment

Uh oh!

chrismwendt Jun 1, 2022

Choose a reason for hiding this comment

Uh oh!

chrismwendt Jun 1, 2022

Choose a reason for hiding this comment

Uh oh!

efritz Jun 1, 2022 • edited by chrismwendt Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrismwendt Jun 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

efritz Jun 1, 2022 • edited by chrismwendt Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

efritz Jun 1, 2022

Choose a reason for hiding this comment

Uh oh!

chrismwendt Jun 1, 2022

Choose a reason for hiding this comment

Uh oh!

efritz Jun 1, 2022

Choose a reason for hiding this comment

Uh oh!

chrismwendt Jun 1, 2022

Choose a reason for hiding this comment

Uh oh!

efritz left a comment

Choose a reason for hiding this comment

Uh oh!

DaedalusG commented Jun 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chrismwendt commented Jun 1, 2022 •

edited

Loading

efritz Jun 1, 2022 •

edited by chrismwendt

Loading

chrismwendt Jun 1, 2022 •

edited

Loading

efritz Jun 1, 2022 •

edited by chrismwendt

Loading