Skip to content

Enhance SnapshotResiliencyTests#49514

Merged
original-brownbear merged 5 commits intoelastic:masterfrom
original-brownbear:more-resiliency-testing
Nov 25, 2019
Merged

Enhance SnapshotResiliencyTests#49514
original-brownbear merged 5 commits intoelastic:masterfrom
original-brownbear:more-resiliency-testing

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear commented Nov 23, 2019

A few enhancements to SnapshotResiliencyTests:

  1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by Use ClusterState as Consistency Source for Snapshot Repositories #49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures)
  2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in Use ClusterState as Consistency Source for Snapshot Repositories #49060 with the additional cluster state updates between request and response)
  3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the client() method
  4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)

A few enhancements to `SnapshotResiliencyTests`:
1. Test running requests from random nodes in more spots
2. Fix issue with restarting only master node in one test
3. Improved cluster formation checks
@original-brownbear original-brownbear added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.6.0 labels Nov 23, 2019
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear original-brownbear added the >test Issues or PRs that are addressing/adding tests label Nov 24, 2019
@original-brownbear original-brownbear marked this pull request as ready for review November 24, 2019 08:48
@original-brownbear
Copy link
Copy Markdown
Contributor Author

Jenkins run elasticsearch-ci/packaging-sample-matrix

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment, looking good o.w.

clearDisruptionsAndAwaitSync();

final StepListener<CleanupRepositoryResponse> cleanupResponse = new StepListener<>();
client().admin().cluster().cleanupRepository(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should check before the clean-up whether the repo is not corrupted (i.e. does not have meta files pointing at non-existing data files).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. does not have meta files pointing at non-existing data files

This is not something the cleanup will resolve ever. The root index-N points at the uuid named generations of shard-level metadata now. Neither of these is changed in content (only a new index-N with the same content as the previous file is written) as a result of the cleanup action.

-> I think it's fine to test things this way around so long as we don't do any repairs in the cleanup. We do the same in all other tests as well (all the SharedClusterSnapshotRestoreIT and such).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is to have stronger assertions, asserting something about the state of the repository before we do any clean-up (and also something that will hold at any point during snapshotting / snapshot deletion etc., which we could check at any time). Having two separate sets of assertions also better shows what's being guaranteed in a situation without clean-up, and what the clean-up is essentially adding. You can do this in a follow-up, I feel however that mixing the two into one here weakens the coverage of these tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will add that in a follow up :)

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks Yannick!

@original-brownbear original-brownbear merged commit fe69c60 into elastic:master Nov 25, 2019
@original-brownbear original-brownbear deleted the more-resiliency-testing branch November 25, 2019 10:42
original-brownbear added a commit that referenced this pull request Nov 25, 2019
A few enhancements to `SnapshotResiliencyTests`:
1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by #49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures)
2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in #49060 with the additional cluster state updates between request and response)
3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the `client()` method
4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)
@original-brownbear original-brownbear restored the more-resiliency-testing branch August 6, 2020 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test Issues or PRs that are addressing/adding tests v7.6.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants