-
Notifications
You must be signed in to change notification settings - Fork 4.1k
rpc: strenghthen the behavior of CockroachDB in "fail slow" TCP connect timeouts #53410
Description
The RPC layer relies on the network reporting when a TCP connection fails, to decide that the other side is unreachable.
For example, this logic is used to determine when connectivity to a node is lost, or to skip over a replica while discovering a leaseholder.
TLDR: Today, the CockroachDB logic is optimized for the "fail fast" scenario, and we have virtually no testing for "fail slow" scenarios.
Background
Depending on the network configuration connections can fail in two ways:
-
they can fail fast, with a TCP RST sent immediately in response to the TCP SYN.
This results in the well-known "Connection refused" error. This is the default network configuration in most OSes when the target IP address is valid, but there is no service listening on the desired port.
(they can also fail fast if there is a non-crdb network service at the remote address, in which case the TLS handshake fails quickly.)
-
however, they will fail slowly if there is no host at the target IP address, or if a firewall rule indicates to DROP traffic to the target address/port pair.
This results in a TCP handshake that lingers for multiple seconds, while the client network stack waits for a TCP packet in response to SYN requests.
Today, the CockroachDB logic is optimized for the "fail fast" scenario, and we have virtually no testing for "fail slow" scenarios.
"Fail slow" in practice
In practice, we have anecdotal reports of customers/users who encounter performance blips and transient cluster unavailability because they encounter a "fail slow" situation.
The reason for this is that CockroachDB internally uses a timeout to detect connection errors; the timeout is set to multiple seconds, because we cannot use a small timeout (a small timeout would create spurious errors when there is a legitimate network blip, which are common in Clouds).
"Fail slow" situations arise "naturally", for example in the following circumstances:
- the operator installs a firewall and mistakenly sets it to filter node-node traffic.
- the operator moves a CockroachDB to a new IP address, with no host server left at the old IP address.
- the k8s orchestration configuration is changed to use a new network prefix.
Strategy
-
We should document this difference in behavior and invite operators to actively set up their network to achieve "fail fast" in the common case.
In particular, a callout should be added to docs when migrating a node to a new machine, to keep a server listening at the previous IP address until the cluster learns of the new topology.
-
We should add additional testing for "fail slow" scenarios, and inventory the cases where CockroachDB currently misbehaves.
-
We should document the particular symptoms of encountering this issue.
gz#8203
gz#8949
Epic: CRDB-8500
Jira issue: CRDB-3869