Enhance SnapshotResiliencyTests#49514
Enhance SnapshotResiliencyTests#49514original-brownbear merged 5 commits intoelastic:masterfrom original-brownbear:more-resiliency-testing
Conversation
A few enhancements to `SnapshotResiliencyTests`: 1. Test running requests from random nodes in more spots 2. Fix issue with restarting only master node in one test 3. Improved cluster formation checks
|
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
|
Jenkins run elasticsearch-ci/packaging-sample-matrix |
ywelsch
left a comment
There was a problem hiding this comment.
Left one comment, looking good o.w.
| clearDisruptionsAndAwaitSync(); | ||
|
|
||
| final StepListener<CleanupRepositoryResponse> cleanupResponse = new StepListener<>(); | ||
| client().admin().cluster().cleanupRepository( |
There was a problem hiding this comment.
I wonder if we should check before the clean-up whether the repo is not corrupted (i.e. does not have meta files pointing at non-existing data files).
There was a problem hiding this comment.
i.e. does not have meta files pointing at non-existing data files
This is not something the cleanup will resolve ever. The root index-N points at the uuid named generations of shard-level metadata now. Neither of these is changed in content (only a new index-N with the same content as the previous file is written) as a result of the cleanup action.
-> I think it's fine to test things this way around so long as we don't do any repairs in the cleanup. We do the same in all other tests as well (all the SharedClusterSnapshotRestoreIT and such).
There was a problem hiding this comment.
The point is to have stronger assertions, asserting something about the state of the repository before we do any clean-up (and also something that will hold at any point during snapshotting / snapshot deletion etc., which we could check at any time). Having two separate sets of assertions also better shows what's being guaranteed in a situation without clean-up, and what the clean-up is essentially adding. You can do this in a follow-up, I feel however that mixing the two into one here weakens the coverage of these tests.
There was a problem hiding this comment.
Makes sense, will add that in a follow up :)
|
Thanks Yannick! |
A few enhancements to `SnapshotResiliencyTests`: 1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by #49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures) 2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in #49060 with the additional cluster state updates between request and response) 3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the `client()` method 4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)
A few enhancements to
SnapshotResiliencyTests:client()method