Skip to content

rpc: strenghthen the behavior of CockroachDB in "fail slow" TCP connect timeouts #53410

@knz

Description

@knz

The RPC layer relies on the network reporting when a TCP connection fails, to decide that the other side is unreachable.

For example, this logic is used to determine when connectivity to a node is lost, or to skip over a replica while discovering a leaseholder.

TLDR: Today, the CockroachDB logic is optimized for the "fail fast" scenario, and we have virtually no testing for "fail slow" scenarios.

Background

Depending on the network configuration connections can fail in two ways:

  • they can fail fast, with a TCP RST sent immediately in response to the TCP SYN.

    This results in the well-known "Connection refused" error. This is the default network configuration in most OSes when the target IP address is valid, but there is no service listening on the desired port.

    (they can also fail fast if there is a non-crdb network service at the remote address, in which case the TLS handshake fails quickly.)

  • however, they will fail slowly if there is no host at the target IP address, or if a firewall rule indicates to DROP traffic to the target address/port pair.

    This results in a TCP handshake that lingers for multiple seconds, while the client network stack waits for a TCP packet in response to SYN requests.

Today, the CockroachDB logic is optimized for the "fail fast" scenario, and we have virtually no testing for "fail slow" scenarios.

"Fail slow" in practice

In practice, we have anecdotal reports of customers/users who encounter performance blips and transient cluster unavailability because they encounter a "fail slow" situation.

The reason for this is that CockroachDB internally uses a timeout to detect connection errors; the timeout is set to multiple seconds, because we cannot use a small timeout (a small timeout would create spurious errors when there is a legitimate network blip, which are common in Clouds).

"Fail slow" situations arise "naturally", for example in the following circumstances:

  • the operator installs a firewall and mistakenly sets it to filter node-node traffic.
  • the operator moves a CockroachDB to a new IP address, with no host server left at the old IP address.
  • the k8s orchestration configuration is changed to use a new network prefix.

Strategy

  • We should document this difference in behavior and invite operators to actively set up their network to achieve "fail fast" in the common case.

    In particular, a callout should be added to docs when migrating a node to a new machine, to keep a server listening at the previous IP address until the cluster learns of the new topology.

  • We should add additional testing for "fail slow" scenarios, and inventory the cases where CockroachDB currently misbehaves.

  • We should document the particular symptoms of encountering this issue.

gz#8203

gz#8949

Epic: CRDB-8500

Jira issue: CRDB-3869

Metadata

Metadata

Assignees

Labels

A-cc-enablementPertains to current CC production issues or short-term projectsA-kv-22.1-networking(Temporary) label for work in scope for 22.1 in the KV/REPL team.A-kv-serverRelating to the KV-level RPC serverC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.C-performancePerf of queries or internals. Solution not expected to change functional behavior.T-server-and-securityDB Server & SecurityX-server-triaged-202105docs-donedocs-known-limitation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions