-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: delay log truncations in presence of paused followers #84467
Description
Is your feature request related to a problem? Please describe.
PR #83851 introduces pausing of replication to followers on overloaded stores. This frequently leads to these replicas requiring snapshots. It could be desirable to make log truncations more lenient in the presence of overloaded followers, to delay the snapshot until it is clear that it is the cheaper way to catch up the replica.
Also, in regimes where a follower is paused and unpaused periodically for extended periods of time (which is assumed to be a normal case under persistent overload), it is doubly important to avoid repeated snapshots.
Describe the solution you'd like
Set a more lenient MaxLogSize in the raft log queue when overloaded (or paused?) followers are present.
The MaxLogSize set here should be a function of the configured max range size, most likely a fraction of it (since applying a snapshot is vastly more efficient than iterating over the log in the common case).
But perhaps the real change necessary is to set a higher bar for when to even attempt to send a snapshot, i.e. block snapshots not only when the follower is paused, but also when it is still "fairly overloaded" (say score >0.4 where the pausing cutoff is 0.8). It is hard to reason through this though, since snapshots might be required to restore quorum, and such snapshots should not to be delayed.
Describe alternatives you've considered
Additional context
Jira issue: CRDB-17676