Skip to content

kvserver: allow committing entries not in leader's stable storage #88699

@tbg

Description

@tbg

Is your feature request related to a problem? Please describe.

The entire point of quorum replication is to provide high availability for
writes. As long as a quorum of voters can write to disk & communicate, forward
progress ought to be possible.

Unfortunately in practice, etcd/raft forces the leader to append to its local
log before disseminating new log entries.

A leader with a write-degraded storage thus renders the range inoperable to
varying degrees.1 In other words, the blast radius is larger than necessary.

Generally, since users are sensitive even to short blips in latency, avoiding
the high tail latencies of synchronous writes is a good idea. More symmetry
helps with that.

Describe the solution you'd like

Improve the interface of RawNode such that it decouples the operations that
move data to stable storage (voting, log appends, persisting the occasional
commit index) from the other operations (in particular, sending
messages). This requires or at least strongly suggests also doing #17500, since
a leader that cannot apply entries that were committed without being locally
durable will not release latches and thus continues to propagate the effects of
its poor disk health to foreground traffic.

Describe alternatives you've considered

We could more aggressively fail over away from leaders that are not performing
as expected. In a sense, this would be the more "sturdy" alternative and
possibly a better ROI if the main scenario we're trying to address are
persistent degradations.

Decoupling the various Ready tasks should noticeable smooth out kinks for which
a fail-over would be too heavyhanded a solution.
It also allows us to markedly improve the performance in the steady state.

So we should really do both anyway.

Additional context

#88442
#88596
#17500

Jira issue: CRDB-19840

Footnotes

  1. until the disk degrades to the point where the leader fails to heartbeat
    followers at a sufficient frequency, at which point a new leader can be
    elected

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV Team

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions