Add a startup probe to codeintel-db#4136
Conversation
| exec: | ||
| command: | ||
| - /liveness.sh | ||
| startupProbe: |
There was a problem hiding this comment.
Should we be doing this for both databases?
There was a problem hiding this comment.
Added to frontend, too ✅
There was a problem hiding this comment.
Would it be possible to ensure that the database is actually starting up and not unresponsive via pg_isready?
Basically in the startup condition we want:
- a small window where postgres isn't bound to the socket
- a larger window where postgres is actively not accepting connections ("server is starting" error from clients) but replaying the WAL
There was a problem hiding this comment.
Would that help? Whether the startup probe fails because Postgres hasn't bound to the socket yet or it's rejecting connections, the probe would return a failure either way. K8s would continue probing until it uses up all of the failureThreshold retries.
The only way I can see any benefit from pg_isready is if it's use is split across probes:
- Startup probe (short timeout?): succeeds when Postgres has bound to the socket (but might not be accepting connections yet)
- Liveness probe (long timeout?): succeeds when Postgres is accepting connections
Even if we did that, the only potential benefit I see is in the case where Postgres is failing to bind to the socket due to some bad temporary state in the container that would get wiped upon restart. K8s would restart the container and Postgres would become ready after the shorter timeout rather than the longer timeout. That case doesn't seem very likely.
There was a problem hiding this comment.
My idea as code!
What you have now:
for i = 0; i < 360 && !started(); i++ {
sleep 10s
}
What pg_isready would allow:
for i = 0; i < 360 && !started() && starting(); i++ {
sleep 10s
}
The difference is that the first one will continue to loop on more critical conditions besides the database starting up. If the database doesn't start at all it might take an hour to restart.
There was a problem hiding this comment.
K8s would continue probing until it uses up all of the failureThreshold retries.
Ooh I see now. We can't say "hard exit" from the startup probe.
There was a problem hiding this comment.
Won't the first one continue to loop on fewer conditions? The second one has the additional condition && starting(). 🤔
There was a problem hiding this comment.
Ideally we'd be able to do the following but I'm not sure k8s probes are enough to accomplish it:
- if we're not bound or starting up (after like 10s) then exit HARD because postgres isn't doing anything
- if we're starting up keep polling
- if we're ready succeed the probe and go ready
Not sure we can distinguish the last two. Failures are failures (unless there's some exit code shenanigans we can do, like exit 256 sends a different signal to k8s).
There was a problem hiding this comment.
I don't think probes can hard exit https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes
efritz
left a comment
There was a problem hiding this comment.
LGTM from what I understand of probes
|
Yay! |
Prior to this change, it was easy for
codeintel-dbto enter an infinite kill+restart loop in the event that Postgres had to recover database state from an OOM death. A customer ran into a situation where Postgres took ~5 minutes to recover with 11GB Rockskip tables.After this change, Postgres will be given 1 hour to start. Is that enough time, or too long? It's a balance between:
Ideally, the startup probe would know if Postgres is making progress, but I don't know how to easily teach it and for the check to be reliable.Edit: I don't think it would matter because all we can tell k8s is whether or not the startup probe succeeded, not alter thefailureThresholdand bail early.Checklist
Test plan
Details
```yaml apiVersion: apps/v1 kind: Deployment metadata: name: depl spec: replicas: 1 selector: matchLabels: component: web strategy: type: Recreate template: metadata: labels: component: web spec: terminationGracePeriodSeconds: 0 containers: - name: foo image: python command: ["bash"] args: ["-c", "trap SIGTERM exit; while true; do echo >> /log; wc -c /log; sleep 1; done"] startupProbe: exec: command: - "python3" - "-c" - "from pathlib import Path; import sys; sys.exit(1 if len(Path('/log').read_bytes()) < 5 else 0)" failureThreshold: 10 periodSeconds: 1 livenessProbe: exec: command: - "false" failureThreshold: 3 periodSeconds: 1 ```kubectl apply -f deployment.yaml