-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: bump LeaseAppliedIndex instead of reproposing #27054
Description
In #26830, we observed an instance in which the uncommitted portion of the Raft log grew to over 7GB of size, leading to an out-of-memory error due to being loaded into memory on election. This particular out-of-memory issue has been fixed, but not the process by which the uncommitted Raft log grew to 7GB in the first place.
This likely happened because the split queue retried the split repeatedly, each time proposing on the order of 100mb into Raft. This in itself is a problem (that has partially been addressed, see #25233) but a problematic piece of code related to this is the reproposal logic:
https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/replica.go#L4396-L4417
which repeatedly creates copies of proposals that don't commit within a target interval. Consider the case in which Raft commit is significantly delayed or even stalled (lost quorum) -- the mechanism is essentially blowing up the uncommitted tail of the Raft log for no good reason.
This raises the question why we're even trying to repropose these commands. We should simply propose a "no-op" through Raft that allocates a new LeaseAppliedIndex, after which the pending proposals below it are discarded (to be retried externally).
The semantics of that approach seem saner as we never duplicate large commands and as proposers already have to expect retries and are equipped to handle them.
@bdarnell this seems worth doing for 2.1, if you don't have any reservations. The change shouldn't be too big (probably most of the the time spent fixing up the tests that test this through intimate knowledge of the current code).