Make snapshots more resilient to node departure

If a node holding a primary shard leaves the cluster then one of the replica shards is immediately promoted to primary to replace the failed copy. Today if there was a snapshot ongoing when the promotion happens then the corresponding shard-level snapshot fails and the overall snapshot status is at best `PARTIAL`. This is a problem for graceful shutdowns (https://github.com/elastic/elasticsearch/issues/70338), which ideally would not result in any such failures. In cases where a replica is promoted to replace a failed primary it would be better instead to retry the shard-level snapshot on the new primary.

This isn't the first time this idea has come up:

https://github.com/elastic/elasticsearch/blob/9addf0b6ed0ca06b67e641ccbc0b6525d418f5ee/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java#L1091



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make snapshots more resilient to node departure #71333

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Make snapshots more resilient to node departure #71333

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions