Skip to content

Prevent CCR followers from falling fatally far behind #38718

@jasontedor

Description

@jasontedor

A leader shard currently does not provide any guarantees on whether or not it will retain the history that a following shard needs to replicate operations from the leader to the follower. It can happen under normal circumstances and also under error scenarios (network connections are broken, follower goes offline, etc.). In either case though, it is fatal for the follower and the only recourse is file-based recovery which can be expensive. It also means that during this recovery period, the follower is offline, muting its purpose as an extra available copy of the leader shard off-cluster from the leader shard.

The underlying cause here is the nature of soft deletes. They can be merged away up to a retention limit. Currently the retention limit is based on a fixed number of operations, but this is too difficult for users to reason about (e.g., if I want my leader history to be available in case the follower is offline for up to twelve hours, how many operations should I retain?). Instead, we have to use shard history retention leases for the follower to take a lease on the history of operations on the leader shard and provide guarantees up to some time limit.

Relates #37165

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions