storage: thought experiment: productionize kv.raft_log.synchronize==false

<details>
<summary>Details</summary>


## `CockroachDB + !kv.raft_log.synchronize = ?`

I had to think about what would happen if we didn't sync to disk in Raft the
other day, and wanted to write down my thoughts to see if there's anything
obviously wrong with them (which I usually expect there is). cc @bdarnell 

### Background

Performing synchronous disk writes in the critical path is expensive. It is easy
to understand why it is required in a monolithic datastore: once a write is
acknowledged, it must not vanish (the `D` in `ACID`).

We're a distributed system that replicates entries to a majority of nodes before
acknowledging them to the client, and so syncing to disk is less obviously
non-negotiable.

The enemy in this scenario is that by, say, power-cycling a subset of nodes, we
violate promises made, the most prominent of which are Raft-level promises such
as "I have this entry in my log" and "I won't forget that I already voted in
this term". Naively, not syncing would easily allow a node to break those after
returning from a crash.

But nodes could spontaneously combust at any time, too, and we can recover from
a certain number of overlapping fires (as long as quorums remain live). So what
if we treated any scenario in which we may have lost unsynched writes to disk as
a permanent node failure?

### Suggested changes (unoptimized)

Consider making the following changes:

- When a node crashes, it has to re-join the cluster as a new (empty) node
  after. As in,
    - early in the start sequence, check for a `DirtyShutdown` marker. If found,
      wipe the data dir and restart as fresh node.
    - if no marker found, write the `DirtyShutdown` marker and continue booting.
    - on clean shutdown, as the very last action, sync and remove the marker.
- Run without ever having to wait for a disk sync.

This mechanism keeps all the promises: if something is "sent to" disk (whether
it makes it there or not), then either it is still there when the node comes
back (so the promises were held) or the node actually never comes back (it comes
back, but under a new `NodeID` and not burdened by any of its prior promises).

### Drawbacks

If there weren't any drawbacks, we'd obviously be doing this already. The reason
we don't is that this makes a lot of failures that weren't previously fatal
catastrophic.

There are principally two types of failures, namely temporary and permanent
ones. A temporary one is the kind in which a node goes down, but keeps its
state: it can come back and participate again. A permanent one implies the loss
of previously acknowledged information, i.e. losing the whole hard drive
containing the log or, forgetting a few entries in the log or who a vote was
cast for.

In other words, a temporary failure could be modeled as a network outage or just
a long CPU freeze. A permanent failure is the irreversible loss of a replica.

Running without syncing before making promises implies that all unclean
shutdowns have to be treated as a permanent failure.

Permanent failures are expensive. A new replica has to be added to the group; a
snapshot has to be transferred; a config change has to be carried out and new
meta entries are written. This takes time, and the cluster is vulnerable while
it is being repaired.

As the name suggests, permanent failures are also more dangerous. A temporary
outage of the whole cluster may be annoying, but no data is lost. On the other
hand, a permanent outage of a quorum of nodes (say two out of three) is
catastrophic, requires manual recovery, and implies having to live with data
loss.

There's little we can do about the latter except provide appropriate damage
control (for example, syncing in regular intervals), so the bottom line is
that if these failures are expected to be handled gracefully, not syncing is
out of the question.

But assuming that risk is to be taken (say the cluster is in three availability
zones with decent USV and adequately managed enterprise-grade SSDs), we can at
least optimize the recovery window:

### Optimizations

- Instead of wiping the data directory and starting as a fresh node, the data is
  kept but all replicas are transformed into preemptive snapshots (i.e. the
  `ReplicaID`) wiped. The node still gets a new `NodeID` (and `StoreID`s) but
  allows the other members of its Ranges' Raft groups to associate these new IDs
  to the old ones, allowing them to perform configuration changes that adds that
  add Replicas on the previous stores (under their new IDs).
- This avoids having to move any data in the common case, but still incurs the
  cost of carrying out a lot of configuration changes.
- To address this, we could
    - track which replicas are "dirty" and which ones aren't. A replica is
      "dirty" if it has potentially made a promise that hasn't been synched. A
      replica becoming "dirty" has to sync once (to persist that fact) and,
      assuming it hasn't made promises in a sufficiently long time (on the order
      of seconds), wipes its dirty flag (does not have to sync). This
      essentially means that replicas which are dirty are those that see
      regular-enough write traffic. We drop the requirement that new IDs are
      assigned to the node as it starts after a crash, and instead it marks all
      of its dirty replicas as `needs-readd` (so that they are readded with a
      new `ReplicaID`, avoiding the preemptive snapshot as before).
    - go even further and instead of going to `needs-readd` right away, try to
      recover any promises made from the remainder of the nodes for some time,
      i.e. participate in the Raft group passively, as a "black hole replica":
      - don't serve any client traffic.
      - receive and process `MsgApp`, but don't acknowledge anything (Raft would
        have to learn that this is a thing. Perhaps similar to non-voting
        replicas?).
      - propose a dummy command to the Replica. If we see it committed the
        incoming Raft log (via the `MsgApp`) stream, we know that we have all
        the log entries we previously acknowledged, and so that promise has been
        recovered.
      - collect the maximum term across *all* (not only a quorum!) peers. Make sure
        to never vote unless the requested term is higher than the maximum seen.
        This recovers any previous vote cast.
      - Try according strategies for any other promises made until all promises
        are recovered or we give up (in particular, give up if there's a config
        change).
- Assuming the resulting amount of meta range writes is still too high, we could
  try to add enough indirection to store the `ReplicaID` directly on the
  corresponding range (or hashed across the cluster). For example, roughly
  changing the stored `ReplicaDescriptor` to

  ```go
    type ReplicaDescriptor struct {
        PersistentNodeIdentifier int64 // survives the permanent-failure crash
        PersistentNodeIdentifier int64 // ditto
    }
  ```

  and introducing a second translation table which maps these identifiers to
  their incarnation's `NodeID` and `StoreID`s and, most importantly, previous
  `ReplicaID`s. The optimization would be that the `ReplicaID` could be located
  on the respective range it applies to, and not in the central meta tables
  which would otherwise become a hotspot for potentially millions of
  configuration changes. There are many problems with this approach, but it
  could be possible. For now it's just a quarter-baked idea, though.
- Running with much larger ranges. This is is interesting for a variety of
  reasons, the relevant one here being that the number of required configuration
  changes scales like `cluster_data_size / range_size`.
- It could also be interesting to make the decision to sync/not sync not a
  global one, but configurable on, say, a per-zone basis.


</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: thought experiment: productionize kv.raft_log.synchronize==false #19784

`CockroachDB + !kv.raft_log.synchronize = ?`

Background

Suggested changes (unoptimized)

Drawbacks

Optimizations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

storage: thought experiment: productionize kv.raft_log.synchronize==false #19784

Description

CockroachDB + !kv.raft_log.synchronize = ?

Background

Suggested changes (unoptimized)

Drawbacks

Optimizations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`CockroachDB + !kv.raft_log.synchronize = ?`