Fix TODO about Spurious FAILED Snapshots#58994
Fix TODO about Spurious FAILED Snapshots#58994original-brownbear merged 14 commits intoelastic:masterfrom original-brownbear:remove-snapshot-spurious-failed
Conversation
There is no point in writing out snapshots that contain no data that can be restored whatsoever. It may have made sense to do so in the past when there was an `INIT` snapshot step that wrote data to the repository that would've other become unreferenced, but in the current day state machine without the `INIT` step there is no point in doing so.
|
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
|
@tlrx I know we discussed this before and there was some worry about Cloud here. I think that's a non-issue since Cloud is using partial snapshots anyway ever since 7.6, so this change doesn't affect them :) |
ywelsch
left a comment
There was a problem hiding this comment.
I've left a few comments, generally looking good though.
| State.FAILED, indexIds, dataStreams, threadPool.absoluteTimeInMillis(), repositoryData.getGenId(), shards, | ||
| "Indices don't have primary shards " + missing, userMeta, version); | ||
| throw new SnapshotException( | ||
| new Snapshot(repositoryName, snapshotId),"Indices don't have primary shards " + missing); |
| assertEquals(SnapshotState.SUCCESS, getSnapshotsResponse.getSnapshots("test-repo-2").get(0).state()); | ||
| } | ||
|
|
||
| public void testSnapshotStatusOnFailedIndex() throws Exception { |
There was a problem hiding this comment.
While this test used the old behavior to get a failed snapshot, it is still a useful test for listing good and bad snapshots, no?
There was a problem hiding this comment.
I guess the problem I had was that there was no way of creating a FAILED snapshot any longer and the whole premise of this test was to check that the status of a FAILED snapshot is returned properly from APIs.
Then again, as with the SLM test that I removed, let me see if I can create a BwC test for this by manipulating RepositoryData :)
There was a problem hiding this comment.
Alright, brought this back in a much simplified way to make sure we continue to be able to read the FAILED state. I think that's all we need here. Reading failed shard state we test in all kinds of places where we deal with PARTIAL snapshots so I think just faking a 0 shards, failed snapshot here is good enough.
server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTests.java
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| public void testBasicFailureRetention() throws Exception { |
There was a problem hiding this comment.
should we still test this scenario? Can folks disable partial snapshots on SLM?
There was a problem hiding this comment.
I don't think we have to, there's no more FAILED state snapshots being created in the repo with this change (not you oculd only ever get a FAILED snapshot if partial was turned off in the first place). We could force the creation of a FAILED snapshot as a BWC test maybe by manually messing with the RepositoryData (come to think of it ... I'll do that, otherwise we lose coverage for a BwC scenario).
There was a problem hiding this comment.
This needs some trickier tests it turns out but is well worth it given how SLM low-level manages these snapshot states. I opened #59082 to enable the necessary test infrastructure.
For #58994 it would be useful to be able to share test infrastructure. This PR shares `AbstractSnapshotIntegTestCase` for that purpose, dries up SLM tests accordingly and adds a shared and efficient (compared to the previous implementations) way of waiting for no running snapshot operations to the test infrastructure to dry things up further.
…59119) For #58994 it would be useful to be able to share test infrastructure. This PR shares `AbstractSnapshotIntegTestCase` for that purpose, dries up SLM tests accordingly and adds a shared and efficient (compared to the previous implementations) way of waiting for no running snapshot operations to the test infrastructure to dry things up further.
| failedSnapshotName.set(snapshotFuture.get().getSnapshotName()); | ||
| assertNotNull(failedSnapshotName.get()); | ||
| } else { | ||
| final String snapshotName = "failed-snapshot-1"; |
There was a problem hiding this comment.
Faking the FAILED snapshot with the right metadata so that SLM picks it up here now. It's a bit of a corner case since we won't be creating any new FAILED snapshots but probably nice to have this tested to make sure that in rolling upgrade scenarios FAILED snapshots are getting cleaned up eventually.
|
Thanks Yannick! |
There is no point in writing out snapshots that contain no data that can be restored whatsoever. It may have made sense to do so in the past when there was an `INIT` snapshot step that wrote data to the repository that would've other become unreferenced, but in the current day state machine without the `INIT` step there is no point in doing so.
There is no point in writing out snapshots that contain no data that can be restored
whatsoever. It may have made sense to do so in the past when there was an
INITsnapshotstep that wrote data to the repository that would've other become unreferenced, but in the
current day state machine without the
INITstep there is no point in doing so.