Skip to content

kvserver: handle stalled or unresponsive replicas #104262

@erikgrinaker

Description

@erikgrinaker

Replicas may stall indefinitely, i.e. become unresponsive. The typical examples are:

  • Disk stalls.
  • Log stalls.
  • Deadlocks.
  • Blocked syscalls (e.g. network hardware failure or kernel bug).

During a stall, all replica processing (both reads and writes) will stall, because callers are unable to acquire various replica mutexes (e.g. Replica.mu or Replica.readOnlyCmdMu).

In the general case, this can cause permanent range unavailability (see e.g. the failover/non-system/deadlock unavailability benchmark). The exception is disk stalls, which have dedicated handling both in Pebble and CockroachDB, self-terminating the node after 20 seconds and resolving the stall.

A stalled replica may or may not lose its lease:

  • Epoch lease: if the node can send a heartbeat RPC to the liveness range leaseholder, and can perform a noop write on all local disks, the replica retains its lease.

  • Expiration lease: the replica will lose its lease within 6 seconds since it can't replicate the lease extension.

As long as the replica retains its lease, the range will be unavailable.

However, even if the replica loses its lease, and it is acquired elsewhere, this can still cause widespread unavailability. Any request that arrived while the replica still held its lease will hang forever (or until the client times out), blocking e.g. on mutexes, latches, syscalls, etc. Stale DistSender caches may also direct requests to the old, stalled leaseholder, where they get stuck. If there is a bounded worker pool, e.g. a connection pool, then there is a chance that many or all workers will have stalled on the replica (especially if several replicas stall), starving out clients and in the worst case causing a complete workload stall.

There are two main approaches we can take here:

  1. Kill the node. This is the approach taken both by the Pebble disk stall detector and by other databases (e.g. FoundationDB). It is simple, effective, and reliable, but will affect all processing on the node (including otherwise healthy replicas and stores), and can cause quorum loss with correlated stalls across multiple nodes (e.g. network disk outages). Pebble also defaults to a fairly conservative 20 seconds threshold, to avoid false positives, which is considered too high by many users.

  2. Cancel all requests. This is harder, because we have to keep track of the contexts of all in-flight requests (this had a non-negligible latency cost last time we prototyped this), and the requests have to actually be responsive to context cancellation (they're not if they're waiting for a mutex or syscall). But if we can do this, it would limit the impact to the stalled replicas.

There is also the question of how to detect stalls. Pebble does this by timing individual syscalls. At the replica level, we could e.g. have a replica mutex watchdog, or monitor latency in the Raft scheduler and during request processing.

Jira issue: CRDB-28435

Epic CRDB-25199

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-postmortemOriginated from a Postmortem action item.O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLA

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions