-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: thought experiment: productionize kv.raft_log.synchronize==false #19784
Description
Details
CockroachDB + !kv.raft_log.synchronize = ?
I had to think about what would happen if we didn't sync to disk in Raft the
other day, and wanted to write down my thoughts to see if there's anything
obviously wrong with them (which I usually expect there is). cc @bdarnell
Background
Performing synchronous disk writes in the critical path is expensive. It is easy
to understand why it is required in a monolithic datastore: once a write is
acknowledged, it must not vanish (the D in ACID).
We're a distributed system that replicates entries to a majority of nodes before
acknowledging them to the client, and so syncing to disk is less obviously
non-negotiable.
The enemy in this scenario is that by, say, power-cycling a subset of nodes, we
violate promises made, the most prominent of which are Raft-level promises such
as "I have this entry in my log" and "I won't forget that I already voted in
this term". Naively, not syncing would easily allow a node to break those after
returning from a crash.
But nodes could spontaneously combust at any time, too, and we can recover from
a certain number of overlapping fires (as long as quorums remain live). So what
if we treated any scenario in which we may have lost unsynched writes to disk as
a permanent node failure?
Suggested changes (unoptimized)
Consider making the following changes:
- When a node crashes, it has to re-join the cluster as a new (empty) node
after. As in,- early in the start sequence, check for a
DirtyShutdownmarker. If found,
wipe the data dir and restart as fresh node. - if no marker found, write the
DirtyShutdownmarker and continue booting. - on clean shutdown, as the very last action, sync and remove the marker.
- early in the start sequence, check for a
- Run without ever having to wait for a disk sync.
This mechanism keeps all the promises: if something is "sent to" disk (whether
it makes it there or not), then either it is still there when the node comes
back (so the promises were held) or the node actually never comes back (it comes
back, but under a new NodeID and not burdened by any of its prior promises).
Drawbacks
If there weren't any drawbacks, we'd obviously be doing this already. The reason
we don't is that this makes a lot of failures that weren't previously fatal
catastrophic.
There are principally two types of failures, namely temporary and permanent
ones. A temporary one is the kind in which a node goes down, but keeps its
state: it can come back and participate again. A permanent one implies the loss
of previously acknowledged information, i.e. losing the whole hard drive
containing the log or, forgetting a few entries in the log or who a vote was
cast for.
In other words, a temporary failure could be modeled as a network outage or just
a long CPU freeze. A permanent failure is the irreversible loss of a replica.
Running without syncing before making promises implies that all unclean
shutdowns have to be treated as a permanent failure.
Permanent failures are expensive. A new replica has to be added to the group; a
snapshot has to be transferred; a config change has to be carried out and new
meta entries are written. This takes time, and the cluster is vulnerable while
it is being repaired.
As the name suggests, permanent failures are also more dangerous. A temporary
outage of the whole cluster may be annoying, but no data is lost. On the other
hand, a permanent outage of a quorum of nodes (say two out of three) is
catastrophic, requires manual recovery, and implies having to live with data
loss.
There's little we can do about the latter except provide appropriate damage
control (for example, syncing in regular intervals), so the bottom line is
that if these failures are expected to be handled gracefully, not syncing is
out of the question.
But assuming that risk is to be taken (say the cluster is in three availability
zones with decent USV and adequately managed enterprise-grade SSDs), we can at
least optimize the recovery window:
Optimizations
-
Instead of wiping the data directory and starting as a fresh node, the data is
kept but all replicas are transformed into preemptive snapshots (i.e. the
ReplicaID) wiped. The node still gets a newNodeID(andStoreIDs) but
allows the other members of its Ranges' Raft groups to associate these new IDs
to the old ones, allowing them to perform configuration changes that adds that
add Replicas on the previous stores (under their new IDs). -
This avoids having to move any data in the common case, but still incurs the
cost of carrying out a lot of configuration changes. -
To address this, we could
- track which replicas are "dirty" and which ones aren't. A replica is
"dirty" if it has potentially made a promise that hasn't been synched. A
replica becoming "dirty" has to sync once (to persist that fact) and,
assuming it hasn't made promises in a sufficiently long time (on the order
of seconds), wipes its dirty flag (does not have to sync). This
essentially means that replicas which are dirty are those that see
regular-enough write traffic. We drop the requirement that new IDs are
assigned to the node as it starts after a crash, and instead it marks all
of its dirty replicas asneeds-readd(so that they are readded with a
newReplicaID, avoiding the preemptive snapshot as before). - go even further and instead of going to
needs-readdright away, try to
recover any promises made from the remainder of the nodes for some time,
i.e. participate in the Raft group passively, as a "black hole replica":- don't serve any client traffic.
- receive and process
MsgApp, but don't acknowledge anything (Raft would
have to learn that this is a thing. Perhaps similar to non-voting
replicas?). - propose a dummy command to the Replica. If we see it committed the
incoming Raft log (via theMsgApp) stream, we know that we have all
the log entries we previously acknowledged, and so that promise has been
recovered. - collect the maximum term across all (not only a quorum!) peers. Make sure
to never vote unless the requested term is higher than the maximum seen.
This recovers any previous vote cast. - Try according strategies for any other promises made until all promises
are recovered or we give up (in particular, give up if there's a config
change).
- track which replicas are "dirty" and which ones aren't. A replica is
-
Assuming the resulting amount of meta range writes is still too high, we could
try to add enough indirection to store theReplicaIDdirectly on the
corresponding range (or hashed across the cluster). For example, roughly
changing the storedReplicaDescriptortotype ReplicaDescriptor struct { PersistentNodeIdentifier int64 // survives the permanent-failure crash PersistentNodeIdentifier int64 // ditto }
and introducing a second translation table which maps these identifiers to
their incarnation'sNodeIDandStoreIDs and, most importantly, previous
ReplicaIDs. The optimization would be that theReplicaIDcould be located
on the respective range it applies to, and not in the central meta tables
which would otherwise become a hotspot for potentially millions of
configuration changes. There are many problems with this approach, but it
could be possible. For now it's just a quarter-baked idea, though. -
Running with much larger ranges. This is is interesting for a variety of
reasons, the relevant one here being that the number of required configuration
changes scales likecluster_data_size / range_size. -
It could also be interesting to make the decision to sync/not sync not a
global one, but configurable on, say, a per-zone basis.