Remove Snapshot INIT Step#55918
Remove Snapshot INIT Step#55918original-brownbear merged 8 commits intoelastic:masterfrom original-brownbear:no-more-snapshot-init
Conversation
|
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
|
Jenkins run elasticsearch-ci/bwc |
| logger.warn("Failed to load snapshot metadata, assuming repository is in old format", e); | ||
| return OLD_SNAPSHOT_FORMAT; | ||
| } | ||
| return OLD_SNAPSHOT_FORMAT; |
There was a problem hiding this comment.
7.6 always adds the versions to the repository data. If we don't find one here then there's no point in doing any IO because we know that the repository data was written by a version older than 7.6. We can simply go ahead and assume an old version and be done with things. I had to make this change here because I needed this method to work on the CS thread, but we can make it safely in all versions in fact. Loading the individual SnapshotInfo here was never necessary.
| } | ||
| testClusterNodes.randomDataNodeSafe().client.admin().cluster().prepareCreateSnapshot(repoName, snapshotName) | ||
| .execute(snapshotStartedListener); | ||
| .execute(ActionListener.wrap(() -> { |
There was a problem hiding this comment.
This is basically a revert of 93c6d77
Master fail-over would not lead to a failing snapshot response here with any observable frequency when we had the INIT state.
If the client (connected to a data node) lost the connection to the master node (because we disconnect the master temporarily in some runs), it would retry and the master meanwhile would have at the most made it to an INIT state in the CS (which the retry would simply ignore once a new master is elected). Now the first CS update will always be a STARTED state snapshot and a retry can again run into an in-progress snapshot.
We should fix clean retrying some other way (could do what we did for force merge UUIDs and generate the unique SnapshotId in the transport request already so that retries can know they're a retry). Since we haven't released 7.8 and this is a really minor UX win to revert here, I think this is an acceptable "regression".
| .build(); | ||
| } | ||
|
|
||
| public void testDisruptionOnSnapshotInitialization() throws Exception { |
There was a problem hiding this comment.
Obsolete :)
| null, userMeta, version); | ||
| } | ||
| return ClusterState.builder(currentState).putCustom(SnapshotsInProgress.TYPE, | ||
| new SnapshotsInProgress(List.of(newEntry))).build(); |
There was a problem hiding this comment.
Simplified this code a bit since we don't have to find the INIT entry in the existing state any longer and I felt like there was no point in pretending we would have more than one snapshot here yet when we don't have any such thing. Concurrent snapshot operations will need some bigger changes to this step anyway so the pretend loop is of no use yet.
| } | ||
| failureMessage.append("Indices are closed "); | ||
| } | ||
| // TODO: We should just throw here instead of creating a FAILED and hence useless snapshot in the repository |
There was a problem hiding this comment.
I didn't want to make this change here yet, but I think it's a valid change. It's entirely pointless to keep writing these snapshots to the repo. It's against the spirit of what partial == false means IMO and it adds nothing but an unused cluster state and a bunch of unused index metadata to the repo for no good reason.
There was a problem hiding this comment.
As discussed this morning we may want to validate this point with Cloud as it is useful to know which snapshot have failed. Besides this I can't remember any valid reason to write FAILED snapshot into the repository if the snapshot did not even start.
| result.v1().getGenId(), null, Priority.IMMEDIATE, listener)); | ||
| }, | ||
| e -> { | ||
| if (abortedDuringInit) { |
There was a problem hiding this comment.
No need to wait here anymore. If the snapshot was aborted during INIT, then that means it was already simply removed from the CS so no point in waiting here. This only made sense when we had the INIT state and would move INIT -> ABORT -> remove via https://github.com/elastic/elasticsearch/pull/55918/files#diff-a0853be4492c052f24917b5c1464003dL424 (or the apply CS method) which isn't a thing any longer without the INIT state.
| }); | ||
| } | ||
|
|
||
| private static class CleanupAfterErrorListener { |
There was a problem hiding this comment.
This was only used to clean up INIT state snapshots
tlrx
left a comment
There was a problem hiding this comment.
LGTM
Thanks for the extra comments which helped a lot. I think you nailed the corner cases, at least I can't find one which was already handled in some way.
| } | ||
| failureMessage.append("Indices are closed "); | ||
| } | ||
| // TODO: We should just throw here instead of creating a FAILED and hence useless snapshot in the repository |
There was a problem hiding this comment.
As discussed this morning we may want to validate this point with Cloud as it is useful to know which snapshot have failed. Besides this I can't remember any valid reason to write FAILED snapshot into the repository if the snapshot did not even start.
|
Jenkins run elasticsearch-ci/2 |
|
Thanks Tanguy! |
|
Backporting this to |
With #55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine. This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.
Since elastic#55918, snapshot creation no longer has the INIT step so that shards won't be finalized unless its state is completed. This PR removes the obsolete branch for it. See also elastic#143024 (comment)
Since #55918, snapshot creation no longer has the INIT step so that shards won't be finalized unless its state is completed. This PR removes the obsolete branch for it. See also #143024 (comment)
With #55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine.
This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.