Skip to content

Fix BwC Snapshot INIT Path#60006

Merged
original-brownbear merged 2 commits intoelastic:7.xfrom
original-brownbear:59986
Jul 22, 2020
Merged

Fix BwC Snapshot INIT Path#60006
original-brownbear merged 2 commits intoelastic:7.xfrom
original-brownbear:59986

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

There were two subtle bugs here from backporting #56911 to 7.x.

  1. We passed null for the shards map which isn't nullable any longer
    when creating SnapshotsInProgress.Entry, fixed by just passing an empty map
    like the null handling did in the past.
  2. The removal of a failed INIT state snapshot from the cluster state tried
    removing it from the finalization loop (the set of repository names that are
    currently finalizing). This will trip an assertion since the snapshot failed
    before its repository was put into the set. I made the logic ignore the set
    in case we remove a failed INIT state snapshot to restore the old logic to
    exactly as it was before the concurrent snapshots backport to be on the safe
    side here.

Also, added tests that explicitly call the old code paths because as can be seen
from initially missing this, the BwC tests will only run in the configuration new
version master, old version nodes ever so often and having a deterministic test
for the old state machine seems the safest bet here.

Closes #59986

Marking non-issues since this was never released but blocker because it completely breaks mixed version cluster snapshots if there's a new version master node mixed with pre-7.5 nodes.

There were two subtle bugs here from backporting #56911 to 7.x.

1. We passed `null` for the `shards` map which isn't nullable any longer
when creating `SnapshotsInProgress.Entry`, fixed by just passing an empty map
like the `null` handling did in the past.
2. The removal of a failed `INIT` state snapshot from the cluster state tried
removing it from the finalization loop (the set of repository names that are
currently finalizing). This will trip an assertion since the snapshot failed
before its repository was put into the set. I made the logic ignore the set
in case we remove a failed `INIT` state snapshot to restore the old logic to
exactly as it was before the concurrent snapshots backport to be on the safe
side here.

Also, added tests that explicitly call the old code paths because as can be seen
from initially missing this, the BwC tests will only run in the configuration new
version master, old version nodes ever so often and having a deterministic test
for the old state machine seems the safest bet here.

Closes #59986
Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

private void removeFailedSnapshotFromClusterState(Snapshot snapshot, Exception failure, @Nullable RepositoryData repositoryData,
@Nullable CleanupAfterErrorListener listener) {
assert failure != null : "Failure must be supplied";
assert (listener == null || repositoryData == null) && (repositoryData != null || listener != null) :
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe refomulate this as (listener == null && repositoryData == null) == false?

Copy link
Copy Markdown
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Glad you fixed this 👍

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks Yannick + Tanguy!

@original-brownbear original-brownbear merged commit c06c9fb into elastic:7.x Jul 22, 2020
@original-brownbear original-brownbear deleted the 59986 branch July 22, 2020 08:09
original-brownbear added a commit that referenced this pull request Jul 22, 2020
There were two subtle bugs here from backporting #56911 to 7.x.

1. We passed `null` for the `shards` map which isn't nullable any longer
when creating `SnapshotsInProgress.Entry`, fixed by just passing an empty map
like the `null` handling did in the past.
2. The removal of a failed `INIT` state snapshot from the cluster state tried
removing it from the finalization loop (the set of repository names that are
currently finalizing). This will trip an assertion since the snapshot failed
before its repository was put into the set. I made the logic ignore the set
in case we remove a failed `INIT` state snapshot to restore the old logic to
exactly as it was before the concurrent snapshots backport to be on the safe
side here.

Also, added tests that explicitly call the old code paths because as can be seen
from initially missing this, the BwC tests will only run in the configuration new
version master, old version nodes ever so often and having a deterministic test
for the old state machine seems the safest bet here.

Closes #59986
@original-brownbear original-brownbear restored the 59986 branch August 6, 2020 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocker :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue v7.9.0 v7.10.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants