-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Prevent CCR followers from falling fatally far behind #38718
Description
A leader shard currently does not provide any guarantees on whether or not it will retain the history that a following shard needs to replicate operations from the leader to the follower. It can happen under normal circumstances and also under error scenarios (network connections are broken, follower goes offline, etc.). In either case though, it is fatal for the follower and the only recourse is file-based recovery which can be expensive. It also means that during this recovery period, the follower is offline, muting its purpose as an extra available copy of the leader shard off-cluster from the leader shard.
The underlying cause here is the nature of soft deletes. They can be merged away up to a retention limit. Currently the retention limit is based on a fixed number of operations, but this is too difficult for users to reason about (e.g., if I want my leader history to be available in case the follower is offline for up to twelve hours, how many operations should I retain?). Instead, we have to use shard history retention leases for the follower to take a lease on the history of operations on the leader shard and provide guarantees up to some time limit.
Relates #37165