If a node holding a primary shard leaves the cluster then one of the replica shards is immediately promoted to primary to replace the failed copy. Today if there was a snapshot ongoing when the promotion happens then the corresponding shard-level snapshot fails and the overall snapshot status is at best PARTIAL. This is a problem for graceful shutdowns (#70338), which ideally would not result in any such failures. In cases where a replica is promoted to replace a failed primary it would be better instead to retry the shard-level snapshot on the new primary.
This isn't the first time this idea has come up:
|
// TODO: Restart snapshot on another node? |
If a node holding a primary shard leaves the cluster then one of the replica shards is immediately promoted to primary to replace the failed copy. Today if there was a snapshot ongoing when the promotion happens then the corresponding shard-level snapshot fails and the overall snapshot status is at best
PARTIAL. This is a problem for graceful shutdowns (#70338), which ideally would not result in any such failures. In cases where a replica is promoted to replace a failed primary it would be better instead to retry the shard-level snapshot on the new primary.This isn't the first time this idea has come up:
elasticsearch/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
Line 1091 in 9addf0b