Skip to content

storage: thought experiment: productionize kv.raft_log.synchronize==false #19784

@tbg

Description

@tbg
Details

CockroachDB + !kv.raft_log.synchronize = ?

I had to think about what would happen if we didn't sync to disk in Raft the
other day, and wanted to write down my thoughts to see if there's anything
obviously wrong with them (which I usually expect there is). cc @bdarnell

Background

Performing synchronous disk writes in the critical path is expensive. It is easy
to understand why it is required in a monolithic datastore: once a write is
acknowledged, it must not vanish (the D in ACID).

We're a distributed system that replicates entries to a majority of nodes before
acknowledging them to the client, and so syncing to disk is less obviously
non-negotiable.

The enemy in this scenario is that by, say, power-cycling a subset of nodes, we
violate promises made, the most prominent of which are Raft-level promises such
as "I have this entry in my log" and "I won't forget that I already voted in
this term". Naively, not syncing would easily allow a node to break those after
returning from a crash.

But nodes could spontaneously combust at any time, too, and we can recover from
a certain number of overlapping fires (as long as quorums remain live). So what
if we treated any scenario in which we may have lost unsynched writes to disk as
a permanent node failure?

Suggested changes (unoptimized)

Consider making the following changes:

  • When a node crashes, it has to re-join the cluster as a new (empty) node
    after. As in,
    • early in the start sequence, check for a DirtyShutdown marker. If found,
      wipe the data dir and restart as fresh node.
    • if no marker found, write the DirtyShutdown marker and continue booting.
    • on clean shutdown, as the very last action, sync and remove the marker.
  • Run without ever having to wait for a disk sync.

This mechanism keeps all the promises: if something is "sent to" disk (whether
it makes it there or not), then either it is still there when the node comes
back (so the promises were held) or the node actually never comes back (it comes
back, but under a new NodeID and not burdened by any of its prior promises).

Drawbacks

If there weren't any drawbacks, we'd obviously be doing this already. The reason
we don't is that this makes a lot of failures that weren't previously fatal
catastrophic.

There are principally two types of failures, namely temporary and permanent
ones. A temporary one is the kind in which a node goes down, but keeps its
state: it can come back and participate again. A permanent one implies the loss
of previously acknowledged information, i.e. losing the whole hard drive
containing the log or, forgetting a few entries in the log or who a vote was
cast for.

In other words, a temporary failure could be modeled as a network outage or just
a long CPU freeze. A permanent failure is the irreversible loss of a replica.

Running without syncing before making promises implies that all unclean
shutdowns have to be treated as a permanent failure.

Permanent failures are expensive. A new replica has to be added to the group; a
snapshot has to be transferred; a config change has to be carried out and new
meta entries are written. This takes time, and the cluster is vulnerable while
it is being repaired.

As the name suggests, permanent failures are also more dangerous. A temporary
outage of the whole cluster may be annoying, but no data is lost. On the other
hand, a permanent outage of a quorum of nodes (say two out of three) is
catastrophic, requires manual recovery, and implies having to live with data
loss.

There's little we can do about the latter except provide appropriate damage
control (for example, syncing in regular intervals), so the bottom line is
that if these failures are expected to be handled gracefully, not syncing is
out of the question.

But assuming that risk is to be taken (say the cluster is in three availability
zones with decent USV and adequately managed enterprise-grade SSDs), we can at
least optimize the recovery window:

Optimizations

  • Instead of wiping the data directory and starting as a fresh node, the data is
    kept but all replicas are transformed into preemptive snapshots (i.e. the
    ReplicaID) wiped. The node still gets a new NodeID (and StoreIDs) but
    allows the other members of its Ranges' Raft groups to associate these new IDs
    to the old ones, allowing them to perform configuration changes that adds that
    add Replicas on the previous stores (under their new IDs).

  • This avoids having to move any data in the common case, but still incurs the
    cost of carrying out a lot of configuration changes.

  • To address this, we could

    • track which replicas are "dirty" and which ones aren't. A replica is
      "dirty" if it has potentially made a promise that hasn't been synched. A
      replica becoming "dirty" has to sync once (to persist that fact) and,
      assuming it hasn't made promises in a sufficiently long time (on the order
      of seconds), wipes its dirty flag (does not have to sync). This
      essentially means that replicas which are dirty are those that see
      regular-enough write traffic. We drop the requirement that new IDs are
      assigned to the node as it starts after a crash, and instead it marks all
      of its dirty replicas as needs-readd (so that they are readded with a
      new ReplicaID, avoiding the preemptive snapshot as before).
    • go even further and instead of going to needs-readd right away, try to
      recover any promises made from the remainder of the nodes for some time,
      i.e. participate in the Raft group passively, as a "black hole replica":
      • don't serve any client traffic.
      • receive and process MsgApp, but don't acknowledge anything (Raft would
        have to learn that this is a thing. Perhaps similar to non-voting
        replicas?).
      • propose a dummy command to the Replica. If we see it committed the
        incoming Raft log (via the MsgApp) stream, we know that we have all
        the log entries we previously acknowledged, and so that promise has been
        recovered.
      • collect the maximum term across all (not only a quorum!) peers. Make sure
        to never vote unless the requested term is higher than the maximum seen.
        This recovers any previous vote cast.
      • Try according strategies for any other promises made until all promises
        are recovered or we give up (in particular, give up if there's a config
        change).
  • Assuming the resulting amount of meta range writes is still too high, we could
    try to add enough indirection to store the ReplicaID directly on the
    corresponding range (or hashed across the cluster). For example, roughly
    changing the stored ReplicaDescriptor to

      type ReplicaDescriptor struct {
          PersistentNodeIdentifier int64 // survives the permanent-failure crash
          PersistentNodeIdentifier int64 // ditto
      }

    and introducing a second translation table which maps these identifiers to
    their incarnation's NodeID and StoreIDs and, most importantly, previous
    ReplicaIDs. The optimization would be that the ReplicaID could be located
    on the respective range it applies to, and not in the central meta tables
    which would otherwise become a hotspot for potentially millions of
    configuration changes. There are many problems with this approach, but it
    could be possible. For now it's just a quarter-baked idea, though.

  • Running with much larger ranges. This is is interesting for a variety of
    reasons, the relevant one here being that the number of required configuration
    changes scales like cluster_data_size / range_size.

  • It could also be interesting to make the decision to sync/not sync not a
    global one, but configurable on, say, a per-zone basis.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-clientRelating to the KV client and the KV interface.C-wishlistA wishlist feature.X-stale

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions