-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: excessively large raft logs #27772
Description
We've seen a handful of occurrences of excessively large Raft logs in clusters in the wild without being able to identify the root cause. A large Raft log should normally not occur as the proposal quota mechanism should limit the size of the Raft log if a range is healthy. If one node in the range is down the proposal quota mechanism will drop that replica from consideration, but at the same time allow Raft log truncation to occur, so the size of the Raft log should be limited to whatever Raft log truncation dictates (default 4MiB).
While each follower in a range maintains a Raft log, control of what gets written to the Raft log is in the hands of the leader. No leader and there are no Raft log writes. What happens if there is a leader but the Raft log entries it proposes are never getting applied? This can happen when the range is below quorum. Prior to 2.0, the Raft CheckQuorum mechanism would kick in and the leader would quickly step down. In 2.0 we turned on PreVote and disabled CheckQuorum (@bdarnell are they incompatible?). With CheckQuorum disabled and PreVote enabled a leader can remain the leader forever when the range is below quorum. The PreVote mechanism prevents another range from calling an election and without CheckQuorum enabled the leader won't step down.
Note that the proposal quota mechanism still applies in this scenario and incoming operations will eventually block waiting for quota. There is another mechanism at work. The periodic Raft proposal refresh (which is necessary to deal with dropped proposals) interacts badly with this scenario. In particular, we refresh pending proposals every leader election timeout period (3s?). Refreshing a proposal only results in reproposing if the lease index is still compatible, but if a range is below quorum no other commands will be applied so it seems like we would see reproposals.
The above is a theory based on a reading of the code. It matches the conditions in the clusters that have experienced problems with large Raft logs (i.e. ranges that have gone through long periods of unavailability). @nvanbenschoten is going to work on writing a test to reproduce which should be straightforward to do if the above is correct.
Fixing this problem shouldn't be too difficult. @bdarnell says:
we want to stop refreshRaftProposals(reasonTicks) if liveness indicates that we don't have a quorum of live followers
and
i think we might also want to only do reasonTicks refreshes if we are a follower
the reason that path exists is due to raft's unacknowledged proposal forwarding
if we handled ErrProposalDropped (#21849) and did our own forwarding, i think we might be able to get rid of reasonTicks refreshes completely