Skip to content

kvserver: delay log truncations in presence of paused followers #84467

@tbg

Description

@tbg

Is your feature request related to a problem? Please describe.

PR #83851 introduces pausing of replication to followers on overloaded stores. This frequently leads to these replicas requiring snapshots. It could be desirable to make log truncations more lenient in the presence of overloaded followers, to delay the snapshot until it is clear that it is the cheaper way to catch up the replica.

Also, in regimes where a follower is paused and unpaused periodically for extended periods of time (which is assumed to be a normal case under persistent overload), it is doubly important to avoid repeated snapshots.

Describe the solution you'd like

Set a more lenient MaxLogSize in the raft log queue when overloaded (or paused?) followers are present.
The MaxLogSize set here should be a function of the configured max range size, most likely a fraction of it (since applying a snapshot is vastly more efficient than iterating over the log in the common case).

But perhaps the real change necessary is to set a higher bar for when to even attempt to send a snapshot, i.e. block snapshots not only when the follower is paused, but also when it is still "fairly overloaded" (say score >0.4 where the pausing cutoff is 0.8). It is hard to reason through this though, since snapshots might be required to restore quorum, and such snapshots should not to be delayed.

Describe alternatives you've considered

Additional context

Jira issue: CRDB-17676

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-replicationRelating to Raft, consensus, and coordination.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)X-staleno-issue-activity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions