Skip to content
This repository was archived by the owner on Jun 9, 2025. It is now read-only.

Add a startup probe to codeintel-db#4136

Merged
chrismwendt merged 2 commits into
masterfrom
lenient-codeintel-db-probes
Jun 1, 2022
Merged

Add a startup probe to codeintel-db#4136
chrismwendt merged 2 commits into
masterfrom
lenient-codeintel-db-probes

Conversation

@chrismwendt

@chrismwendt chrismwendt commented Jun 1, 2022

Copy link
Copy Markdown
Contributor

Prior to this change, it was easy for codeintel-db to enter an infinite kill+restart loop in the event that Postgres had to recover database state from an OOM death. A customer ran into a situation where Postgres took ~5 minutes to recover with 11GB Rockskip tables.

After this change, Postgres will be given 1 hour to start. Is that enough time, or too long? It's a balance between:

  • Reducing the likelihood customers get stuck and require manual intervention when Postgres is recovering with large tables (we have concrete evidence this happens, see above)
  • Restarting Postgres quickly when there's some bad pod state lying around and a restart would fix it (hypothetical, seems relatively unlikely)

Ideally, the startup probe would know if Postgres is making progress, but I don't know how to easily teach it and for the check to be reliable. Edit: I don't think it would matter because all we can tell k8s is whether or not the startup probe succeeded, not alter the failureThreshold and bail early.

Checklist

Test plan

Details ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: depl spec: replicas: 1 selector: matchLabels: component: web strategy: type: Recreate template: metadata: labels: component: web spec: terminationGracePeriodSeconds: 0 containers: - name: foo image: python command: ["bash"] args: ["-c", "trap SIGTERM exit; while true; do echo >> /log; wc -c /log; sleep 1; done"] startupProbe: exec: command: - "python3" - "-c" - "from pathlib import Path; import sys; sys.exit(1 if len(Path('/log').read_bytes()) < 5 else 0)" failureThreshold: 10 periodSeconds: 1 livenessProbe: exec: command: - "false" failureThreshold: 3 periodSeconds: 1 ```
  • kubectl apply -f deployment.yaml
  • Wait 5s, startup probe succeeds, container is up
  • Wait 3s, liveness probe fails, k8s kills container

@chrismwendt chrismwendt requested a review from efritz June 1, 2022 04:52
exec:
command:
- /liveness.sh
startupProbe:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be doing this for both databases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frontend?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to frontend, too ✅

@efritz efritz Jun 1, 2022

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to ensure that the database is actually starting up and not unresponsive via pg_isready?

Basically in the startup condition we want:

  • a small window where postgres isn't bound to the socket
  • a larger window where postgres is actively not accepting connections ("server is starting" error from clients) but replaying the WAL

@chrismwendt chrismwendt Jun 1, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that help? Whether the startup probe fails because Postgres hasn't bound to the socket yet or it's rejecting connections, the probe would return a failure either way. K8s would continue probing until it uses up all of the failureThreshold retries.

The only way I can see any benefit from pg_isready is if it's use is split across probes:

  • Startup probe (short timeout?): succeeds when Postgres has bound to the socket (but might not be accepting connections yet)
  • Liveness probe (long timeout?): succeeds when Postgres is accepting connections

Even if we did that, the only potential benefit I see is in the case where Postgres is failing to bind to the socket due to some bad temporary state in the container that would get wiped upon restart. K8s would restart the container and Postgres would become ready after the shorter timeout rather than the longer timeout. That case doesn't seem very likely.

@efritz efritz Jun 1, 2022

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea as code!

What you have now:

for i = 0; i < 360 && !started(); i++ {
     sleep 10s
}

What pg_isready would allow:

for i = 0; i < 360 && !started() && starting(); i++ {
     sleep 10s
}

The difference is that the first one will continue to loop on more critical conditions besides the database starting up. If the database doesn't start at all it might take an hour to restart.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K8s would continue probing until it uses up all of the failureThreshold retries.

Ooh I see now. We can't say "hard exit" from the startup probe.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't the first one continue to loop on fewer conditions? The second one has the additional condition && starting(). 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd be able to do the following but I'm not sure k8s probes are enough to accomplish it:

  • if we're not bound or starting up (after like 10s) then exit HARD because postgres isn't doing anything
  • if we're starting up keep polling
  • if we're ready succeed the probe and go ready

Not sure we can distinguish the last two. Failures are failures (unless there's some exit code shenanigans we can do, like exit 256 sends a different signal to k8s).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efritz efritz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from what I understand of probes

@DaedalusG

Copy link
Copy Markdown
Contributor

Yay!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants