-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Do not allow stale replicas to automatically be promoted to primary #14671
Description
Consider a primary shard P hosted on node p and its replica shard Q hosted on node q. If p is isolated from the cluster (e.g., through node failure, a flapping NIC, or an excessively long garbage collection pause), indexing operations can continue on q after Q is promoted to primary; these indexing operations will be acknowledged to the requesting clients. If q is subsequently isolated before p rejoins and before a new replica is assigned to another node in the cluster, the subsequent rejoining of p can currently lead to P being promoted to primary again. The indexing operations acknowledged by q will be lost.
A mechanism needs to be built to prevent the automatic promotion of a stale shard in such a scenario and instead only promote a non-stale shard to primary (if a non-stale shard is availabie). The only scenario in which a stale shard should be promoted to primary is through manual intervention by a system operator (e.g., in cases when q suffers a total hardware failure).
Relates #10933