CockroachDB: Health checks time out under load

When a CockroachDB node is near 100% CPU usage, requests to any of the health check endpoints (`/health`, `/health?ready=1` or `/_admin/v1/health`) will sometimes hang. In our [example Kubernetes manifests](https://github.com/cockroachdb/cockroach/blob/a3036c481c097cd31f697d91ff90b9679b8fa738/cloud/kubernetes/cockroachdb-statefulset.yaml#L122), we have a timeout on our health checks of 1 second, but I have observed the endpoint fail to respond for 20+ seconds. The node is still otherwise up and able to process SQL queries.

This is a major problem when using an HTTP request to one of these endpoints as a Kubernetes [liveness probe](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes). When Kubernetes detects multiple liveness probes fail in a row, it believes that the CockroachDB container has crashed and will restart the container. This results downtime on a single-node cluster. On a three-node cluster, all three liveness probes can fail within a short period of time, also resulting in downtime.

Up until now, we have been using HTTP requests to the `/health` endpoint as our liveness probe. This is the behaviour specified within our [example Kubernetes manifests](https://github.com/cockroachdb/cockroach/blob/a3036c481c097cd31f697d91ff90b9679b8fa738/cloud/kubernetes/cockroachdb-statefulset.yaml#L122).

To reproduce this behaviour, run `cockroach workload init tpcc --warehouses=100` on a single-node GCP n1-standard-2 cluster (exact number of warehouses for a perfect repro TBD). On a Kubernetes cluster, our default liveness probe will fail, resulting in the CockroachDB container being restarted and the `workload init` command failing. If instead, you change the liveness probe to be a TCP check against either the HTTP or GRPC port, or remove the liveness probe entirely, the liveness probe will not fail and the `workload init` command will succeed. Outside of a Kubernetes environment, you can also reproduce this issue by repeatedly sending HTTP requests to a health check endpoint and measuring the response time.

Ideally, what we would like from CockroachDB are two health check endpoints:

* One for use as the liveness probe, which always responds healthy unless the application needs to be restarted.

* One for use as the [readiness probe](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes), which responds healthy if and only if that CRDB pod is able to receive requests. We currently use `/health?ready=1` as our readiness probe endpoint. It will also time out under load, but that is less severe than a liveness probe timing out. Kubernetes will stop sending traffic directed to the CockroachDB service to any pod whose readiness probe is failing until the readiness probe recovers.

In the meantime, we have removed the liveness probes from our Kubernetes deployments of CockroachDB, which has helped keep CockroachDB running under load (Kubernetes will still restart the CockroachDB container if CockroachDB actually crashes). We should consider recommending this for other customers deploying CockroachDB on top of Kubernetes.

Epic: CRDB-549

Jira issue: CRDB-5200

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CockroachDB: Health checks time out under load #44832

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CockroachDB: Health checks time out under load #44832

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions