Make the Postgres healthchecks more lenient#824
Conversation
| interval: 10s | ||
| timeout: 1s | ||
| retries: 10 | ||
| retries: 360 |
There was a problem hiding this comment.
If the health of a running container can fail for a full hour, is this healthcheck really even doing anything valuable? Perhaps modifying the start_period is a better option here, if this is intended to address the same slow startup issue as in sourcegraph/deploy-sourcegraph#4136.
https://docs.docker.com/engine/reference/builder/#healthcheck
start period provides initialization time for containers that need time to bootstrap. Probe failure during that period will not be counted towards the maximum number of retries. However, if a health check succeeds during the start period, the container is considered started and all consecutive failures will be counted towards the maximum number of retries.
An hour feels too long here (and on the Kubernetes startup probe), but I don't have any hard data to gauge typical recovery startup times. It might not make any difference - most of the the non-OOM failures we see aren't recoverable from a restart (example: bad file system permissions).
There was a problem hiding this comment.
👍 switched to start_period.
| interval: 10s | ||
| timeout: 1s | ||
| retries: 10 | ||
| retries: 360 |
There was a problem hiding this comment.
Is it intentional that codeinsights-db isn't included in this change?
There was a problem hiding this comment.
Updated codeinsights-db, too.
See sourcegraph/deploy-sourcegraph#4136
Checklist
Test plan
N/A