Skip to content

Aborting a Snapshot Queued after a Finalizing Snapshot is Broken #75598

@original-brownbear

Description

@original-brownbear

There is a bug in the concurrent snapshot logic where the following situation involving three concurrent snapshots and a snapshot delete is broken and may lead to writing corrupted repository metadata:

  1. Start 3 snapshots for the same two indices
  2. Abort the one in the middle before after the first snapshot finishes on the data node (as far as writing to the repository goes) but before the index gets out of the queued state for the second snapshot
  3. third snapshot is moved started once the middling snapshot completes to FAILED state but has null for the shard generation for shards in the shared index

This is a fairly unlikely scenario to run into since the abort must be timed just right, but it's somewhat more likely if the second snapshot has a larger diff with the first snapshot (so writing the files takes longer ... though even in this scenario the CS with hte abort has to be applied on the data node right after finishing the last file).
=> fixing this asap but probably not before next week

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions