Skip to content

Remove Snapshot INIT Step#55918

Merged
original-brownbear merged 8 commits intoelastic:masterfrom
original-brownbear:no-more-snapshot-init
May 5, 2020
Merged

Remove Snapshot INIT Step#55918
original-brownbear merged 8 commits intoelastic:masterfrom
original-brownbear:no-more-snapshot-init

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear commented Apr 29, 2020

With #55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine.

This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.

@original-brownbear original-brownbear added >non-issue WIP :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.8.0 labels Apr 29, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Apr 29, 2020
@original-brownbear
Copy link
Copy Markdown
Contributor Author

Jenkins run elasticsearch-ci/bwc

logger.warn("Failed to load snapshot metadata, assuming repository is in old format", e);
return OLD_SNAPSHOT_FORMAT;
}
return OLD_SNAPSHOT_FORMAT;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7.6 always adds the versions to the repository data. If we don't find one here then there's no point in doing any IO because we know that the repository data was written by a version older than 7.6. We can simply go ahead and assume an old version and be done with things. I had to make this change here because I needed this method to work on the CS thread, but we can make it safely in all versions in fact. Loading the individual SnapshotInfo here was never necessary.

}
testClusterNodes.randomDataNodeSafe().client.admin().cluster().prepareCreateSnapshot(repoName, snapshotName)
.execute(snapshotStartedListener);
.execute(ActionListener.wrap(() -> {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically a revert of 93c6d77

Master fail-over would not lead to a failing snapshot response here with any observable frequency when we had the INIT state.
If the client (connected to a data node) lost the connection to the master node (because we disconnect the master temporarily in some runs), it would retry and the master meanwhile would have at the most made it to an INIT state in the CS (which the retry would simply ignore once a new master is elected). Now the first CS update will always be a STARTED state snapshot and a retry can again run into an in-progress snapshot.
We should fix clean retrying some other way (could do what we did for force merge UUIDs and generate the unique SnapshotId in the transport request already so that retries can know they're a retry). Since we haven't released 7.8 and this is a really minor UX win to revert here, I think this is an acceptable "regression".

.build();
}

public void testDisruptionOnSnapshotInitialization() throws Exception {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obsolete :)

null, userMeta, version);
}
return ClusterState.builder(currentState).putCustom(SnapshotsInProgress.TYPE,
new SnapshotsInProgress(List.of(newEntry))).build();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified this code a bit since we don't have to find the INIT entry in the existing state any longer and I felt like there was no point in pretending we would have more than one snapshot here yet when we don't have any such thing. Concurrent snapshot operations will need some bigger changes to this step anyway so the pretend loop is of no use yet.

}
failureMessage.append("Indices are closed ");
}
// TODO: We should just throw here instead of creating a FAILED and hence useless snapshot in the repository
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to make this change here yet, but I think it's a valid change. It's entirely pointless to keep writing these snapshots to the repo. It's against the spirit of what partial == false means IMO and it adds nothing but an unused cluster state and a bunch of unused index metadata to the repo for no good reason.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed this morning we may want to validate this point with Cloud as it is useful to know which snapshot have failed. Besides this I can't remember any valid reason to write FAILED snapshot into the repository if the snapshot did not even start.

result.v1().getGenId(), null, Priority.IMMEDIATE, listener));
},
e -> {
if (abortedDuringInit) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to wait here anymore. If the snapshot was aborted during INIT, then that means it was already simply removed from the CS so no point in waiting here. This only made sense when we had the INIT state and would move INIT -> ABORT -> remove via https://github.com/elastic/elasticsearch/pull/55918/files#diff-a0853be4492c052f24917b5c1464003dL424 (or the apply CS method) which isn't a thing any longer without the INIT state.

});
}

private static class CleanupAfterErrorListener {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only used to clean up INIT state snapshots

@original-brownbear original-brownbear marked this pull request as ready for review April 29, 2020 12:52
Copy link
Copy Markdown
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for the extra comments which helped a lot. I think you nailed the corner cases, at least I can't find one which was already handled in some way.

}
failureMessage.append("Indices are closed ");
}
// TODO: We should just throw here instead of creating a FAILED and hence useless snapshot in the repository
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed this morning we may want to validate this point with Cloud as it is useful to know which snapshot have failed. Besides this I can't remember any valid reason to write FAILED snapshot into the repository if the snapshot did not even start.

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Jenkins run elasticsearch-ci/2

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks Tanguy!

@original-brownbear original-brownbear merged commit f4022c0 into elastic:master May 5, 2020
@original-brownbear original-brownbear deleted the no-more-snapshot-init branch May 5, 2020 16:06
@mfussenegger mfussenegger mentioned this pull request May 13, 2020
37 tasks
@original-brownbear
Copy link
Copy Markdown
Contributor Author

Backporting this to 7.x is extremely involved (because 7.x must still be able to go through the INIT step as long as there's a pre-7.5 version node in the cluster and correctly deal with all the resulting corner cases throughout the codebase). <= Just making a note here that this backport (and other snapshot PRs that depend on it) hasn't been forgotten. But I'd like to raise this in the snapshot resiliency meeting first to discuss strategy for this kind of BwC problem first before deciding on a technical solution for the backport.

original-brownbear added a commit that referenced this pull request Jul 13, 2020
With #55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine.

This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.
@original-brownbear original-brownbear restored the no-more-snapshot-init branch August 6, 2020 19:08
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Mar 6, 2026
Since elastic#55918, snapshot creation no longer has the INIT step so that
shards won't be finalized unless its state is completed. This PR removes
the obsolete branch for it.

See also elastic#143024 (comment)
elasticsearchmachine pushed a commit that referenced this pull request Mar 10, 2026
Since #55918, snapshot creation no longer has the INIT step so that
shards won't be finalized unless its state is completed. This PR removes
the obsolete branch for it.

See also
#143024 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Meta label for distributed team. v7.9.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants