-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Proposal: support fully control fsync frequency in raft #12257
Description
Support fully control fsync frequency in raft
Summary
Let application decides when to sync logs and make leader aware of the safe index
to commit.
Motivation
The current way of driving raft is a loop of following process
1. get ready from raft
2. persist ready
3. advance ready in raft
Only one ready is allowed to be processed at a time, and the leader expects all pending logs in the ready are synced to disk before handling next one.
The mechanism is simple but has several problems:
- it can cause high fsync frequency. In the practice of TiKV, we observed the fsync frequency can reach the limit of hardware very easily, which can cause unstable latency and hurt performance. Also high fsync frequency is also expensive in the cloud.
- sync size is unpredictable, which can waste IO when syncing small data size.
So we need a way to control the fsync frequency without hurting correctness and also get maximum performance.
Detailed design
Since all nodes are communicating with messages, so application can stash all messages and send them out only after logs are synced. In this way, application is free to fetch and advance as many readiness as they want and sync whenever they want.
It works great for followers. However, for leaders, both Etcd and TiKV have follow the raft thesis 10.2.1 that sends out messages of leader before syncing to make leader and follower write in parallel. If leader batches up more than two readiness, it can commit an un-synced logs accidentally. For example, if leader has logs 3, 4 in the first ready, after advancing and receiving followers' ACK, it can broadcast commit to all followers in the second ready. This is because leader assumes logs are synced in the first ready. To fix the problem, we need to stop updating leader's progress until logs are synced to disk. So when leader calculates commit index only those really sync progress are considered.
It's possible quorum followers sync logs before leader, in such case, leader is also safe to consider the logs are committed, but it should not mark them ready to apply until logs are synced, otherwise it can corrupt the state between raft and application.
Note that batching changes to hard state doesn't affect the correctness. Changes to term and vote can only happen when the node is follower. So messages will not be sent out before changes are synced, which is same as before. Changes to commit can happen when the node is leader, however, commit index is not required to be store at all according to raft thesis 3.8.
To conclude, we need to make those changes to raft library:
- Don't update leader's progress when logs are appended,
- Add a method to allow application inform raft about the synced index, which should change its progress and trigger logs commitment.
All others are left to applications to guarantee choose different processing logic for different roles.
Drawbacks
It's a little complicated for applications to do the batch.
Alternatives
Pause handling raft ready until application is ready to sync data. This approach doesn't need to make any changes to raft library, but it can cause write pulse to disk. Our benches show that the approach doesn't perform as well as the proposed one.