Skip to content

Track Shard-Snapshot Index Generation at Repository Root #46250

Merged
original-brownbear merged 182 commits intoelastic:masterfrom
original-brownbear:smarter-incrementality-snapshots
Oct 22, 2019
Merged

Track Shard-Snapshot Index Generation at Repository Root #46250
original-brownbear merged 182 commits intoelastic:masterfrom
original-brownbear:smarter-incrementality-snapshots

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear commented Sep 3, 2019

Changes to Root-Level index-N (RepositoryData)

This change adds a new field "shards" to RepositoryData that contains a mapping of IndexId to a String[]. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the N in the name of the currently valid /indices/${indexId}/${shardId}/index-${N} for the shard in question).

Benefits

This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root index-${N} is determined by listing).

Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the master node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs.

Only Master Deletes Blobs

This change moves the deleting of the previous shard level index-${uuid} blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the RepositoryData by first updating RepositoryData and then deleting the now unreferenced index-${newUUID} blob.
No deletes are executed on the data nodes at all for any operation with this change.

Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details).

Why Move from index-${N} to index-${uuid} at the Shard Level

This change changes the naming of the shard level index-${N} blobs to a uuid suffix index-${UUID}. The reason for this is the fact that writing a new shard-level index- generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level index-N (RepositoryData) to become an effective part of the snapshot repository.
This leads to a problem if we were to use incrementing names like we did before. If a blob index-${N+1} is written but due to the node/network/cluster/... crashes the root level RepositoryData has not been updated then a future operation will determine the shard's generation to be N and try to write a new index-${N+1} to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes.
Previously stuck data nodes that were tasked to write index-${N+1} but got stuck and tried to do so after some other node had already written index-${N+1} were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur.
Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the N+1 generation or force us to continue using a LIST operation to figure out the next N (which would make this change pointless).
With uuid naming and moving all deletes to master this becomes a non-issue. Data nodes write updated shard generation index-${uuid} and master makes those index-${uuid} part of the RepositoryData that it deems correct and cleans up all those index- that are unused.

@original-brownbear original-brownbear added >enhancement WIP :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Sep 3, 2019
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed

@original-brownbear original-brownbear changed the title Track Shard-Snapshot Index Generationat Repository Root Track Shard-Snapshot Index Generation at Repository Root Sep 3, 2019
@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks @ywelsch, I like that patch :) Applied in 1bdbf2c with minor change 68e407c (to make it a little more obvious what fails the overall request and what doesn't) now :)

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Jenkins run elasticsearch-ci/2 (unrelated xpack security failure)

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left one important comment that needs addressing, o.w. LGTM

@original-brownbear original-brownbear merged commit 8c13cf7 into elastic:master Oct 22, 2019
@original-brownbear original-brownbear deleted the smarter-incrementality-snapshots branch October 22, 2019 18:26
@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks so much for grinding through this @tlrx and @ywelsch, much appreciated !

original-brownbear added a commit that referenced this pull request Oct 25, 2019
This relates to the effort towards #46250. We added
tracking of the shard generation for successful
snapshots to `8.0`.
This assertion isn't correct though. While an `8.0`
master won't create an entry with sucess state and
a null shard generation it may still (on e.g. master
failover) send a success entry created by a 7.x master
with a `null` generation over the wire.

Closes #47406
original-brownbear added a commit that referenced this pull request Jun 5, 2020
This PR introduces two new fields in to `RepositoryData` (index-N) to track the blob name of `IndexMetaData` blobs and their content via setting generations and uuids. This is used to deduplicate the `IndexMetaData` blobs (`meta-{uuid}.dat` in the indices folders under `/indices` so that new metadata for an index is only written to the repository during a snapshot if that same metadata can't be found in another snapshot.
This saves one write per index in the common case of unchanged metadata thus saving cost and making snapshot finalization drastically faster if many indices are being snapshotted at the same time.

The implementation is mostly analogous to that for shard generations in #46250 and piggy backs on the BwC mechanism introduced in that PR (which means this PR needs adjustments if it doesn't go into `7.6`). 

Relates to #45736 as it improves the efficiency of snapshotting unchanged indices
Relates to #49800 as it has the potential of loading the index metadata for multiple snapshots of the same index concurrently much more efficient speeding up future concurrent snapshot delete
mkleen added a commit to crate/crate that referenced this pull request Jun 29, 2020
mkleen added a commit to crate/crate that referenced this pull request Jun 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement v7.6.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants