Skip to content

replication: avoid fsync during raft log append #88442

@tbg

Description

@tbg

Traditional quorum replication requires all log entries to be durably stored on
stable storage on a quorum of replicas before being considered committed.

In practice, for us this means fdatasync'ing raft log appends. The main downside
of this requirement is that fsync is latency-intensive - CockroachDB runs much
faster on write-heavy workload with fsync turned off or reduced. In fact, some
customers are known to de facto do this, by running on zfs with the fsync
syscall mocked out as a noop.1

Since CockroachDB ranges are usually deployed across AZs or even regions, where
correlated power failures are likely rare, it stands to reason that trading
durability for performance could be beneficial.

As explained in 2, one can run CockroachDB "correctly" with fsync turned off
if one ensures that a node that crashes (i.e. exits in a way that may lose log
writes) does not return to the cluster (i.e. has to be wiped and re-join as new
node). This is equivalent to running with fsync turned on (though more
performant) and pretending that any crash failure is permanent.

This is unappealing due to the need to replicate a lot of data, almost all of
which redundantly. The missing piece is a mechanism that allows a power-cycled
node to return to the cluster gracefully.

To give an explicit example of why naively letting the node rejoin when it
didn't properly obey durability leads to incorrect behavior, consider three
nodes n1, n2, and n3 which form the members of some range r1.

An entry at index 100 is appended to n1 and n2 and reaches quorum. It is
committed and applied by n1. n2 power-cycles and loses index 100 (which it
previously acked). Then we are in the following state:

                 committed
                     |
                     v
log(n1) = [..., 99, 100]
log(n2) = [..., 99]
log(n3) = [..., 99]

which makes it possible for n2 and n3 to jointly elect either of them as the leader, and to subsequently replace the committed (and, likely applied on n1) entry at index 100.

To avoid this, we need to ensure that n2 doesn't vote or campaign until it is
guaranteed to have been caught up across all entries that it may, in a previous
life, have acked. A simple way to do this is to propose (i.e. ask the leader to
propose) an entry carrying a UUID and abstaining from voting until n2 has this
entry in its log. In effect, the follower "is semantically down" until it has
been caught up past what it had acked previously, but it can be brought "up"
again with the minimal amount of work possible.

More work would be necessary to actually do this.

For one, etcd/raft is heavy on assertions tracking whether durable state has
regressed. For example, upon restarting, n2 might be contacted by n1 with an
MsgApp to append an index 101, which n1 considers possible based on what it
believes the durable state of n2 to be (it thinks n2 has index 100). Upon
receiving this message, n2 will exit with a fatal error. We would need to make
raft more accepting of loss of durability.

We may also need to be careful about command application, especially for
specialized commands such as AddSTs, log truncation, splits, etc., though I'm
not sure there are any new complications. To be safe, we could always use a
separate raft command encoding for "non-vanilla" entries and make sure to fsync
whenever one of these enters the log. However, the semantics around configuration
changes are already very complex3

A somewhat related issue is https://github.com/etcd-io/etcd/issues/12257[^4].

There are alternatives to do this outside of raft. For example, if we had bulk replication changes, we could "re-add" the node in place, but under a new replicaID, and with a way to re-use the existing snapshot (i.e. apply it in-place). This would have the same effect, but avoid any complications at the raft layer (since we're not violating durability). Bulk replication changes are tricky, though, since currently any replication change has to update the range descriptor and in particular the meta2 copy. One relevant observation is that currently, the meta2 is identical to the range copy, but it doesn't have to be. The meta2 copy only needs to have enough information to allow CPuts as part of replication changes and to route requests; maybe there is a way to not have it include the ReplicaID, in which case we could bump the replication ID and run a replication change in a 1PC txn on the range itself.

Jira issue: CRDB-19825

Footnotes

  1. though the main driver there is resilience to EBS-level slowdowns on certain write-heavy workloads on underprovisioned gp3 volumes; I'm hesitant to think of disabling raft fsync as the right solution since we are lease-based and so a disk that has slow writes will still have severe problems serving user traffic

  2.  https://github.com/cockroachdb/cockroach/issues/19784

  3. https://github.com/etcd-io/etcd/issues/11284 https://github.com/etcd-io/etcd/issues/7625 https://github.com/etcd-io/etcd/issues/12359

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV Team

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions