Remove Snapshot INIT Step by original-brownbear · Pull Request #55918 · elastic/elasticsearch

original-brownbear · 2020-04-29T10:12:03Z

With #55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine.

This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.

elasticmachine · 2020-04-29T10:12:05Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear · 2020-04-29T10:34:05Z

Jenkins run elasticsearch-ci/bwc

original-brownbear · 2020-04-29T11:49:54Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

-                    logger.warn("Failed to load snapshot metadata, assuming repository is in old format", e);
-                    return OLD_SNAPSHOT_FORMAT;
-                }
+                return OLD_SNAPSHOT_FORMAT;


7.6 always adds the versions to the repository data. If we don't find one here then there's no point in doing any IO because we know that the repository data was written by a version older than 7.6. We can simply go ahead and assume an old version and be done with things. I had to make this change here because I needed this method to work on the CS thread, but we can make it safely in all versions in fact. Loading the individual SnapshotInfo here was never necessary.

original-brownbear · 2020-04-29T11:56:41Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTests.java

                            }
                            testClusterNodes.randomDataNodeSafe().client.admin().cluster().prepareCreateSnapshot(repoName, snapshotName)
-                                .execute(snapshotStartedListener);
+                                    .execute(ActionListener.wrap(() -> {


This is basically a revert of 93c6d77

Master fail-over would not lead to a failing snapshot response here with any observable frequency when we had the INIT state.
If the client (connected to a data node) lost the connection to the master node (because we disconnect the master temporarily in some runs), it would retry and the master meanwhile would have at the most made it to an INIT state in the CS (which the retry would simply ignore once a new master is elected). Now the first CS update will always be a STARTED state snapshot and a retry can again run into an in-progress snapshot.
We should fix clean retrying some other way (could do what we did for force merge UUIDs and generate the unique SnapshotId in the transport request already so that retries can know they're a retry). Since we haven't released 7.8 and this is a really minor UX win to revert here, I think this is an acceptable "regression".

original-brownbear · 2020-04-29T11:58:29Z

server/src/test/java/org/elasticsearch/discovery/SnapshotDisruptionIT.java

            .build();
    }

-    public void testDisruptionOnSnapshotInitialization() throws Exception {


Obsolete :)

original-brownbear · 2020-04-29T12:01:46Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                            null, userMeta, version);
+                }
+                return ClusterState.builder(currentState).putCustom(SnapshotsInProgress.TYPE,
+                        new SnapshotsInProgress(List.of(newEntry))).build();


Simplified this code a bit since we don't have to find the INIT entry in the existing state any longer and I felt like there was no point in pretending we would have more than one snapshot here yet when we don't have any such thing. Concurrent snapshot operations will need some bigger changes to this step anyway so the pretend loop is of no use yet.

original-brownbear · 2020-04-29T12:03:11Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                            }
+                            failureMessage.append("Indices are closed ");
+                        }
+                        // TODO: We should just throw here instead of creating a FAILED and hence useless snapshot in the repository


I didn't want to make this change here yet, but I think it's a valid change. It's entirely pointless to keep writing these snapshots to the repo. It's against the spirit of what partial == false means IMO and it adds nothing but an unused cluster state and a bunch of unused index metadata to the repo for no good reason.

As discussed this morning we may want to validate this point with Cloud as it is useful to know which snapshot have failed. Besides this I can't remember any valid reason to write FAILED snapshot into the repository if the snapshot did not even start.

original-brownbear · 2020-04-29T12:09:23Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

                                        result.v1().getGenId(), null, Priority.IMMEDIATE, listener));
                    },
                    e -> {
-                        if (abortedDuringInit) {


No need to wait here anymore. If the snapshot was aborted during INIT, then that means it was already simply removed from the CS so no point in waiting here. This only made sense when we had the INIT state and would move INIT -> ABORT -> remove via https://github.com/elastic/elasticsearch/pull/55918/files#diff-a0853be4492c052f24917b5c1464003dL424 (or the apply CS method) which isn't a thing any longer without the INIT state.

original-brownbear · 2020-04-29T12:10:45Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

-        });
-    }
-
-    private static class CleanupAfterErrorListener {


This was only used to clean up INIT state snapshots

tlrx

LGTM

Thanks for the extra comments which helped a lot. I think you nailed the corner cases, at least I can't find one which was already handled in some way.

tlrx · 2020-05-05T12:28:33Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                            }
+                            failureMessage.append("Indices are closed ");
+                        }
+                        // TODO: We should just throw here instead of creating a FAILED and hence useless snapshot in the repository


As discussed this morning we may want to validate this point with Cloud as it is useful to know which snapshot have failed. Besides this I can't remember any valid reason to write FAILED snapshot into the repository if the snapshot did not even start.

original-brownbear · 2020-05-05T15:11:12Z

Jenkins run elasticsearch-ci/2

original-brownbear · 2020-05-05T16:05:32Z

Thanks Tanguy!

original-brownbear · 2020-05-18T13:11:11Z

Backporting this to 7.x is extremely involved (because 7.x must still be able to go through the INIT step as long as there's a pre-7.5 version node in the cluster and correctly deal with all the resulting corner cases throughout the codebase). <= Just making a note here that this backport (and other snapshot PRs that depend on it) hasn't been forgotten. But I'd like to raise this in the snapshot resiliency meeting first to discuss strategy for this kind of BwC problem first before deciding on a technical solution for the backport.

With #55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine. This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.

Since elastic#55918, snapshot creation no longer has the INIT step so that shards won't be finalized unless its state is completed. This PR removes the obsolete branch for it. See also elastic#143024 (comment)

Since #55918, snapshot creation no longer has the INIT step so that shards won't be finalized unless its state is completed. This PR removes the obsolete branch for it. See also #143024 (comment)

original-brownbear added 4 commits April 28, 2020 19:49

easy

4aaff07

nicer

a497de8

fixes

8c9ce47

dumb down test

2aa9c31

original-brownbear added >non-issue WIP :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.8.0 labels Apr 29, 2020

elasticmachine added the Team:Distributed Meta label for distributed team. label Apr 29, 2020

original-brownbear added 2 commits April 29, 2020 12:44

shorter diff

016676b

Merge remote-tracking branch 'elastic/master' into no-more-snapshot-init

273c218

original-brownbear commented Apr 29, 2020

View reviewed changes

comment

a87c679

original-brownbear commented Apr 29, 2020

View reviewed changes

original-brownbear marked this pull request as ready for review April 29, 2020 12:52

original-brownbear removed the WIP label Apr 29, 2020

original-brownbear requested review from tlrx and ywelsch April 29, 2020 12:52

tlrx approved these changes May 5, 2020

View reviewed changes

Merge remote-tracking branch 'elastic/master' into no-more-snapshot-init

5d61a7f

original-brownbear merged commit f4022c0 into elastic:master May 5, 2020

original-brownbear deleted the no-more-snapshot-init branch May 5, 2020 16:06

original-brownbear added the backport pending label May 5, 2020

mfussenegger mentioned this pull request May 13, 2020

ES Backports crate/crate#9796

Closed

37 tasks

original-brownbear mentioned this pull request May 16, 2020

Fix SnapshotStatusApisIT #56859

Merged

original-brownbear added v7.9.0 and removed v7.8.0 labels May 16, 2020

original-brownbear removed the backport pending label Jul 12, 2020

original-brownbear mentioned this pull request Jul 12, 2020

Remove Snapshot INIT Step (#55918) #59374

Merged

original-brownbear restored the no-more-snapshot-init branch August 6, 2020 19:08

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

inespot mentioned this pull request Feb 26, 2026

Batch snapshot update tasks after external change #142091

Merged

ywangd mentioned this pull request Mar 6, 2026

Remove obsolete branch for finalizing incomplete shards #143721

Merged

Conversation

original-brownbear commented Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Apr 29, 2020

Uh oh!

original-brownbear commented Apr 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented May 5, 2020

Uh oh!

original-brownbear commented May 5, 2020

Uh oh!

original-brownbear commented May 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

original-brownbear commented Apr 29, 2020 •

edited

Loading