Skip to content

storage: excessively large raft logs #27772

@petermattis

Description

@petermattis

We've seen a handful of occurrences of excessively large Raft logs in clusters in the wild without being able to identify the root cause. A large Raft log should normally not occur as the proposal quota mechanism should limit the size of the Raft log if a range is healthy. If one node in the range is down the proposal quota mechanism will drop that replica from consideration, but at the same time allow Raft log truncation to occur, so the size of the Raft log should be limited to whatever Raft log truncation dictates (default 4MiB).

While each follower in a range maintains a Raft log, control of what gets written to the Raft log is in the hands of the leader. No leader and there are no Raft log writes. What happens if there is a leader but the Raft log entries it proposes are never getting applied? This can happen when the range is below quorum. Prior to 2.0, the Raft CheckQuorum mechanism would kick in and the leader would quickly step down. In 2.0 we turned on PreVote and disabled CheckQuorum (@bdarnell are they incompatible?). With CheckQuorum disabled and PreVote enabled a leader can remain the leader forever when the range is below quorum. The PreVote mechanism prevents another range from calling an election and without CheckQuorum enabled the leader won't step down.

Note that the proposal quota mechanism still applies in this scenario and incoming operations will eventually block waiting for quota. There is another mechanism at work. The periodic Raft proposal refresh (which is necessary to deal with dropped proposals) interacts badly with this scenario. In particular, we refresh pending proposals every leader election timeout period (3s?). Refreshing a proposal only results in reproposing if the lease index is still compatible, but if a range is below quorum no other commands will be applied so it seems like we would see reproposals.

The above is a theory based on a reading of the code. It matches the conditions in the clusters that have experienced problems with large Raft logs (i.e. ranges that have gone through long periods of unavailability). @nvanbenschoten is going to work on writing a test to reproduce which should be straightforward to do if the above is correct.

Fixing this problem shouldn't be too difficult. @bdarnell says:

we want to stop refreshRaftProposals(reasonTicks) if liveness indicates that we don't have a quorum of live followers

and

i think we might also want to only do reasonTicks refreshes if we are a follower

the reason that path exists is due to raft's unacknowledged proposal forwarding

if we handled ErrProposalDropped (#21849) and did our own forwarding, i think we might be able to get rid of reasonTicks refreshes completely

Metadata

Metadata

Labels

A-kv-clientRelating to the KV client and the KV interface.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.S-1-stabilitySevere stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions