Skip to content

Proposal: support fully control fsync frequency in raft #12257

@BusyJay

Description

@BusyJay

Support fully control fsync frequency in raft

Summary

Let application decides when to sync logs and make leader aware of the safe index
to commit.

Motivation

The current way of driving raft is a loop of following process

1. get ready from raft
2. persist ready
3. advance ready in raft

Only one ready is allowed to be processed at a time, and the leader expects all pending logs in the ready are synced to disk before handling next one.

The mechanism is simple but has several problems:

  1. it can cause high fsync frequency. In the practice of TiKV, we observed the fsync frequency can reach the limit of hardware very easily, which can cause unstable latency and hurt performance. Also high fsync frequency is also expensive in the cloud.
  2. sync size is unpredictable, which can waste IO when syncing small data size.

So we need a way to control the fsync frequency without hurting correctness and also get maximum performance.

Detailed design

Since all nodes are communicating with messages, so application can stash all messages and send them out only after logs are synced. In this way, application is free to fetch and advance as many readiness as they want and sync whenever they want.

It works great for followers. However, for leaders, both Etcd and TiKV have follow the raft thesis 10.2.1 that sends out messages of leader before syncing to make leader and follower write in parallel. If leader batches up more than two readiness, it can commit an un-synced logs accidentally. For example, if leader has logs 3, 4 in the first ready, after advancing and receiving followers' ACK, it can broadcast commit to all followers in the second ready. This is because leader assumes logs are synced in the first ready. To fix the problem, we need to stop updating leader's progress until logs are synced to disk. So when leader calculates commit index only those really sync progress are considered.

It's possible quorum followers sync logs before leader, in such case, leader is also safe to consider the logs are committed, but it should not mark them ready to apply until logs are synced, otherwise it can corrupt the state between raft and application.

Note that batching changes to hard state doesn't affect the correctness. Changes to term and vote can only happen when the node is follower. So messages will not be sent out before changes are synced, which is same as before. Changes to commit can happen when the node is leader, however, commit index is not required to be store at all according to raft thesis 3.8.

To conclude, we need to make those changes to raft library:

  1. Don't update leader's progress when logs are appended,
  2. Add a method to allow application inform raft about the synced index, which should change its progress and trigger logs commitment.

All others are left to applications to guarantee choose different processing logic for different roles.

Drawbacks

It's a little complicated for applications to do the batch.

Alternatives

Pause handling raft ready until application is ready to sync data. This approach doesn't need to make any changes to raft library, but it can cause write pulse to disk. Our benches show that the approach doesn't perform as well as the proposed one.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions