Share IT Infrastructure between Core Snapshot and SLM ITs#59082
Share IT Infrastructure between Core Snapshot and SLM ITs#59082original-brownbear merged 1 commit intoelastic:masterfrom original-brownbear:unify-snapshot-test-infra
Conversation
For #58994 it would be useful to be able to share test infrastructure. This PR shares `AbstractSnapshotIntegTestCase` for that purpose, dries up SLM tests accordingly and adds a shared and efficient (compared to the previous implementations) way of waiting for no running snapshot operations to the test infrastructure to dry things up further. Note: the shared way of waiting for no more running operations was extracted from #56911
|
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
|
Pinging @elastic/es-core-features (:Core/Features/ILM+SLM) |
| final ClusterService clusterService = internalCluster().getInstance(ClusterService.class, viaNode); | ||
| final ThreadPool threadPool = internalCluster().getInstance(ThreadPool.class, viaNode); | ||
| final ClusterStateObserver observer = new ClusterStateObserver(clusterService, logger, threadPool.getThreadContext()); | ||
| if (statePredicate.test(observer.setAndGetObservedState()) == false) { |
There was a problem hiding this comment.
I like this a lot better than:
dataNodeClient().admin().cluster().prepareState().get().getState();+ busy assert.
Often times the tests use this kind of waiting when the busy assert will fail for a bit and then waste a second or two until the next run because of exponential back-off (over the large number of tests that do this kind of waiting for a certain CS it's well worth taking this approach IMO, especially with the concurrent snapshot ITs adding a large number of new tests that need this thing).
Also, the client() approach can (in disruption ITs) randomly go for the client of an isolated node and then waste effort and time for retrying in the transport master node action.
There was a problem hiding this comment.
I understand the motivation; I'm wondering if it should always wait for next change?
There was a problem hiding this comment.
I think waiting for the next change would make it very hard to avoid races. We often have this pattern:
- do operation
- wait for no more operations running
If the first step completes before we start waiting we dead-lock. And this is just one example, the concurrent snapshotting tests make use of this logic in other spots where similar races could occur
There was a problem hiding this comment.
Ok. I was wondering if something similar could happen: 1. do operation and 2. check cluster state on a data node that has not yet processed the updated cluster state
There was a problem hiding this comment.
👍 yea that was an issue in the concurrency tests, in the end it just requires ensuring that stuff actually started by some other means before waiting for it to go away :)
| } | ||
| } | ||
|
|
||
| public static void unblockAllDataNodes(String repository) { |
There was a problem hiding this comment.
All of these methods were just copies from AbstractSnapshotIntegTestCase that's why no other changes to the code were needed here.
| assertEquals(SnapshotState.SUCCESS, snapshotInfo.state()); | ||
| }); | ||
| } | ||
| awaitNoMoreRunningOperations(internalCluster().getMasterName()); |
There was a problem hiding this comment.
Needed to add a new wait here because the AbstractSnapshotIntegTestCase does some repo consistency checks in after the tests (well worth it to have these here anyway) which will break if there's still work done in the cluster (which may be the case in this test).
|
Thanks Tanguy! |
…59119) For #58994 it would be useful to be able to share test infrastructure. This PR shares `AbstractSnapshotIntegTestCase` for that purpose, dries up SLM tests accordingly and adds a shared and efficient (compared to the previous implementations) way of waiting for no running snapshot operations to the test infrastructure to dry things up further.
Fixed an issue #59082 introduced. We have to wait for no more operations in all tests here not just the one we were waiting in already so that the cleanup operation from the parent class can run without failure.
For #58994 it would be useful to be able to share test infrastructure.
This PR shares
AbstractSnapshotIntegTestCasefor that purpose, dries up SLM testsaccordingly and adds a shared and efficient (compared to the previous implementations)
way of waiting for no running snapshot operations to the test infrastructure to dry things up further.
Note: the shared way of waiting for no more running operations was extracted from #56911 so this PR also decreases the size of that huge PR :)