storage: excessively large raft logs

We've seen a handful of occurrences of excessively large Raft logs in clusters in the wild without being able to identify the root cause. A large Raft log should normally not occur as the proposal quota mechanism should limit the size of the Raft log if a range is healthy. If one node in the range is down the proposal quota mechanism will drop that replica from consideration, but at the same time allow Raft log truncation to occur, so the size of the Raft log should be limited to whatever Raft log truncation dictates (default 4MiB).

While each follower in a range maintains a Raft log, control of what gets written to the Raft log is in the hands of the leader. No leader and there are no Raft log writes. What happens if there is a leader but the Raft log entries it proposes are never getting applied? This can happen when the range is below quorum. Prior to 2.0, the Raft `CheckQuorum` mechanism would kick in and the leader would quickly step down. In 2.0 we turned on `PreVote` and disabled `CheckQuorum` (@bdarnell are they incompatible?). With `CheckQuorum` disabled and `PreVote` enabled a leader can remain the leader forever when the range is below quorum. The `PreVote` mechanism prevents another range from calling an election and without `CheckQuorum` enabled the leader won't step down. 

Note that the proposal quota mechanism still applies in this scenario and incoming operations will eventually block waiting for quota. There is another mechanism at work. The periodic Raft proposal refresh (which is necessary to deal with dropped proposals) interacts badly with this scenario. In particular, we refresh pending proposals every leader election timeout period (3s?). Refreshing a proposal only results in reproposing if the lease index is still compatible, but if a range is below quorum no other commands will be applied so it seems like we would see reproposals.

The above is a theory based on a reading of the code. It matches the conditions in the clusters that have experienced problems with large Raft logs (i.e. ranges that have gone through long periods of unavailability). @nvanbenschoten is going to work on writing a test to reproduce which should be straightforward to do if the above is correct. 

Fixing this problem shouldn't be too difficult. @bdarnell says:

> we want to stop refreshRaftProposals(reasonTicks) if liveness indicates that we don't have a quorum of live followers

and

> i think we might also want to only do reasonTicks refreshes if we are a follower
>
> the reason that path exists is due to raft's unacknowledged proposal forwarding
>
> if we handled ErrProposalDropped (https://github.com/cockroachdb/cockroach/issues/21849) and did our own forwarding, i think we might be able to get rid of reasonTicks refreshes completely

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: excessively large raft logs #27772

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

storage: excessively large raft logs #27772

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions