Skip to content

Fix TODO about Spurious FAILED Snapshots#58994

Merged
original-brownbear merged 14 commits intoelastic:masterfrom
original-brownbear:remove-snapshot-spurious-failed
Jul 8, 2020
Merged

Fix TODO about Spurious FAILED Snapshots#58994
original-brownbear merged 14 commits intoelastic:masterfrom
original-brownbear:remove-snapshot-spurious-failed

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

There is no point in writing out snapshots that contain no data that can be restored
whatsoever. It may have made sense to do so in the past when there was an INIT snapshot
step that wrote data to the repository that would've other become unreferenced, but in the
current day state machine without the INIT step there is no point in doing so.

There is no point in writing out snapshots that contain no data that can be restored
whatsoever. It may have made sense to do so in the past when there was an `INIT` snapshot
step that wrote data to the repository that would've other become unreferenced, but in the
current day state machine without the `INIT` step there is no point in doing so.
@original-brownbear original-brownbear added >non-issue :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.9.0 labels Jul 3, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Jul 3, 2020
@original-brownbear
Copy link
Copy Markdown
Contributor Author

@tlrx I know we discussed this before and there was some worry about Cloud here. I think that's a non-issue since Cloud is using partial snapshots anyway ever since 7.6, so this change doesn't affect them :)

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments, generally looking good though.

State.FAILED, indexIds, dataStreams, threadPool.absoluteTimeInMillis(), repositoryData.getGenId(), shards,
"Indices don't have primary shards " + missing, userMeta, version);
throw new SnapshotException(
new Snapshot(repositoryName, snapshotId),"Indices don't have primary shards " + missing);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space missing.

assertEquals(SnapshotState.SUCCESS, getSnapshotsResponse.getSnapshots("test-repo-2").get(0).state());
}

public void testSnapshotStatusOnFailedIndex() throws Exception {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this test used the old behavior to get a failed snapshot, it is still a useful test for listing good and bad snapshots, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the problem I had was that there was no way of creating a FAILED snapshot any longer and the whole premise of this test was to check that the status of a FAILED snapshot is returned properly from APIs.
Then again, as with the SLM test that I removed, let me see if I can create a BwC test for this by manipulating RepositoryData :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, brought this back in a much simplified way to make sure we continue to be able to read the FAILED state. I think that's all we need here. Reading failed shard state we test in all kinds of places where we deal with PARTIAL snapshots so I think just faking a 0 shards, failed snapshot here is good enough.

}
}

public void testBasicFailureRetention() throws Exception {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we still test this scenario? Can folks disable partial snapshots on SLM?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to, there's no more FAILED state snapshots being created in the repo with this change (not you oculd only ever get a FAILED snapshot if partial was turned off in the first place). We could force the creation of a FAILED snapshot as a BWC test maybe by manually messing with the RepositoryData (come to think of it ... I'll do that, otherwise we lose coverage for a BwC scenario).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some trickier tests it turns out but is well worth it given how SLM low-level manages these snapshot states. I opened #59082 to enable the necessary test infrastructure.

original-brownbear added a commit that referenced this pull request Jul 7, 2020
For #58994 it would be useful to be able to share test infrastructure.
This PR shares `AbstractSnapshotIntegTestCase` for that purpose, dries up SLM tests
accordingly and adds a shared and efficient (compared to the previous implementations)
way of waiting for no running snapshot operations to the test infrastructure to dry things up further.
original-brownbear added a commit that referenced this pull request Jul 7, 2020
…59119)

For #58994 it would be useful to be able to share test infrastructure.
This PR shares `AbstractSnapshotIntegTestCase` for that purpose, dries up SLM tests
accordingly and adds a shared and efficient (compared to the previous implementations)
way of waiting for no running snapshot operations to the test infrastructure to dry things up further.
failedSnapshotName.set(snapshotFuture.get().getSnapshotName());
assertNotNull(failedSnapshotName.get());
} else {
final String snapshotName = "failed-snapshot-1";
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faking the FAILED snapshot with the right metadata so that SLM picks it up here now. It's a bit of a corner case since we won't be creating any new FAILED snapshots but probably nice to have this tested to make sure that in rolling upgrade scenarios FAILED snapshots are getting cleaned up eventually.

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks Yannick!

@original-brownbear original-brownbear merged commit 02539fa into elastic:master Jul 8, 2020
@original-brownbear original-brownbear deleted the remove-snapshot-spurious-failed branch July 8, 2020 11:13
original-brownbear added a commit that referenced this pull request Jul 14, 2020
There is no point in writing out snapshots that contain no data that can be restored
whatsoever. It may have made sense to do so in the past when there was an `INIT` snapshot
step that wrote data to the repository that would've other become unreferenced, but in the
current day state machine without the `INIT` step there is no point in doing so.
@original-brownbear original-brownbear restored the remove-snapshot-spurious-failed branch August 6, 2020 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Meta label for distributed team. v7.9.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants