Skip to content

Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations#75733

Merged
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:fix-failed-shard-snapshot-queuing
Jul 27, 2021
Merged

Fix Concurrent Snapshot Repository Corruption from Operations Queued after Failing Operations#75733
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:fix-failed-shard-snapshot-queuing

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

The node executing a shard level operation would in many cases communicate null for the shard state update,
leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.

This change fixes the logic on the side of the node executing the shard level operation and reproduces the issue in one case. The exact case in the linked issue is very hard to reproduce unfortunately because it required very specific timing that our test infra currently does easily enable). This should also be fixed in the master node logic (to make the code more obviously correct and fix mixed-version clusters better) but I'd like to delay that and do it in #75501 because of how tricky (as in lots of confusing code for clones and snapshots and so on) it would be to figure out the correct generation from the cluster-state without that refactoring.

closes #75598

…after Failing Operations

The node executing a shard level operation would in many cases communicate `null` for the shard state update,
leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.

closes #75598
@original-brownbear original-brownbear added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.14.1 v7.15.0 labels Jul 27, 2021
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Jul 27, 2021
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

AbstractSnapshotIntegTestCase.<MockRepository>getRepositoryOnNode(repository, nodeName).blockOnDataFiles();
}

public static void blockAndFailDataNode(String repository, String nodeName) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we didn't have this logic revealed an unfortunate lack of test coverage. We have a number of tests that simulate data-node failure but they're all based on blocking the data-node via the existing block-and-wait and then shutting down the blocked data nodes which triggers a very different code path on master.

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, I asked for some slightly tighter assertions if possible.

localNodeId,
ShardState.FAILED,
"failed to clone shard snapshot",
shardStatusBefore.generation()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we shift some of the @Nullable annotations and associated assertions around in ShardSnapshotStatus? With this change I think we might still receive a null generation over the wire from an older version, but we shouldn't be creating them afresh any more?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact we ought to change the wire format so it's no longer an OptionalString. I'm ok with not doing that here, it'll make the backport that much harder, a follow up is fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would in fact still create them for pre-7.6 state machine operation (still a thing if there's an old snapshot in your repo) and we don't have access to the snapshot version in these constructors. In these null means (figure out the numeric generation yourself which would happen to a queued operation if e.g. the first operation for a shard in the CS fails).

Let me see what I can do about this though :)

Copy link
Copy Markdown
Contributor Author

@original-brownbear original-brownbear Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad news here I'm afraid. I had to remove the assertion since it didn't hold up but it's also really hard to assert this elsewhere due to the BwC situation (we can't neatly do this in ShardSnapshotStatus without refactoring its construction and doing it elsewhere is tricky as well since it's so many places right now).
If it's ok with you I think I'd rather look for a cleaner way of asserting this stuff once #75501 has landed (or actually as part of incorporating this into that change) and just assert that we're not doing any illegal changes to SnapshotsInProgress like this any longer where non-null generation becomes null generation for a given shard (much easier if we don't have to hack around translating ShardId and RepoShardId all over the place)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks for looking. Blasted BwC, always spoiling the fun.

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks David!

@original-brownbear original-brownbear merged commit f1ba7c4 into elastic:master Jul 27, 2021
@original-brownbear original-brownbear deleted the fix-failed-shard-snapshot-queuing branch July 27, 2021 14:13
ywangd pushed a commit to ywangd/elasticsearch that referenced this pull request Jul 30, 2021
…after Failing Operations (elastic#75733)

The node executing a shard level operation would in many cases communicate `null` for the shard state update,
leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.

closes elastic#75598
original-brownbear added a commit that referenced this pull request Aug 16, 2021
…after Failing Operations (#75733) (#76548)

The node executing a shard level operation would in many cases communicate `null` for the shard state update,
leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.

closes #75598
original-brownbear added a commit that referenced this pull request Aug 16, 2021
…after Failing Operations (#75733) (#76548) (#76556)

The node executing a shard level operation would in many cases communicate `null` for the shard state update,
leading to follow-up operations incorrectly assuming an empty shard snapshot directory and starting from scratch.

closes #75598
@original-brownbear original-brownbear restored the fix-failed-shard-snapshot-queuing branch April 18, 2023 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team. v7.14.1 v7.15.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Aborting a Snapshot Queued after a Finalizing Snapshot is Broken

6 participants