Share IT Infrastructure between Core Snapshot and SLM ITs by original-brownbear · Pull Request #59082 · elastic/elasticsearch

original-brownbear · 2020-07-06T15:09:58Z

For #58994 it would be useful to be able to share test infrastructure.
This PR shares AbstractSnapshotIntegTestCase for that purpose, dries up SLM tests
accordingly and adds a shared and efficient (compared to the previous implementations)
way of waiting for no running snapshot operations to the test infrastructure to dry things up further.

Note: the shared way of waiting for no more running operations was extracted from #56911 so this PR also decreases the size of that huge PR :)

For #58994 it would be useful to be able to share test infrastructure. This PR shares `AbstractSnapshotIntegTestCase` for that purpose, dries up SLM tests accordingly and adds a shared and efficient (compared to the previous implementations) way of waiting for no running snapshot operations to the test infrastructure to dry things up further. Note: the shared way of waiting for no more running operations was extracted from #56911

elasticmachine · 2020-07-06T15:10:01Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

elasticmachine · 2020-07-06T15:10:03Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

original-brownbear · 2020-07-06T15:17:47Z

test/framework/src/main/java/org/elasticsearch/snapshots/AbstractSnapshotIntegTestCase.java

+        final ClusterService clusterService = internalCluster().getInstance(ClusterService.class, viaNode);
+        final ThreadPool threadPool = internalCluster().getInstance(ThreadPool.class, viaNode);
+        final ClusterStateObserver observer = new ClusterStateObserver(clusterService, logger, threadPool.getThreadContext());
+        if (statePredicate.test(observer.setAndGetObservedState()) == false) {


I like this a lot better than:

dataNodeClient().admin().cluster().prepareState().get().getState();

+ busy assert.

Often times the tests use this kind of waiting when the busy assert will fail for a bit and then waste a second or two until the next run because of exponential back-off (over the large number of tests that do this kind of waiting for a certain CS it's well worth taking this approach IMO, especially with the concurrent snapshot ITs adding a large number of new tests that need this thing).
Also, the client() approach can (in disruption ITs) randomly go for the client of an isolated node and then waste effort and time for retrying in the transport master node action.

I understand the motivation; I'm wondering if it should always wait for next change?

I think waiting for the next change would make it very hard to avoid races. We often have this pattern:

do operation

wait for no more operations running

If the first step completes before we start waiting we dead-lock. And this is just one example, the concurrent snapshotting tests make use of this logic in other spots where similar races could occur

Ok. I was wondering if something similar could happen: 1. do operation and 2. check cluster state on a data node that has not yet processed the updated cluster state

👍 yea that was an issue in the concurrency tests, in the end it just requires ensuring that stuff actually started by some other means before waiting for it to go away :)

Good! Thanks

original-brownbear · 2020-07-06T15:19:07Z

x-pack/plugin/ilm/src/test/java/org/elasticsearch/xpack/slm/SLMSnapshotBlockingIntegTests.java

-        }
-    }
-
-    public static void unblockAllDataNodes(String repository) {


All of these methods were just copies from AbstractSnapshotIntegTestCase that's why no other changes to the code were needed here.

original-brownbear · 2020-07-06T15:20:29Z

x-pack/plugin/ilm/src/test/java/org/elasticsearch/xpack/slm/SLMSnapshotBlockingIntegTests.java

                assertEquals(SnapshotState.SUCCESS, snapshotInfo.state());
            });
        }
+        awaitNoMoreRunningOperations(internalCluster().getMasterName());


Needed to add a new wait here because the AbstractSnapshotIntegTestCase does some repo consistency checks in after the tests (well worth it to have these here anyway) which will break if there's still work done in the cluster (which may be the case in this test).

tlrx

LGTM - I left a minor question

original-brownbear · 2020-07-07T08:39:10Z

Thanks Tanguy!

…59119) For #58994 it would be useful to be able to share test infrastructure. This PR shares `AbstractSnapshotIntegTestCase` for that purpose, dries up SLM tests accordingly and adds a shared and efficient (compared to the previous implementations) way of waiting for no running snapshot operations to the test infrastructure to dry things up further.

Fixed an issue #59082 introduced. We have to wait for no more operations in all tests here not just the one we were waiting in already so that the cleanup operation from the parent class can run without failure.

original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. v8.0.0 v7.9.0 labels Jul 6, 2020

elasticmachine added the Team:Distributed Meta label for distributed team. label Jul 6, 2020

elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Jul 6, 2020

original-brownbear mentioned this pull request Jul 6, 2020

Fix TODO about Spurious FAILED Snapshots #58994

Merged

original-brownbear commented Jul 6, 2020

View reviewed changes

original-brownbear requested review from tlrx and ywelsch July 6, 2020 16:06

tlrx approved these changes Jul 7, 2020

View reviewed changes

original-brownbear merged commit 4ed6c0e into elastic:master Jul 7, 2020

original-brownbear deleted the unify-snapshot-test-infra branch July 7, 2020 08:39

original-brownbear mentioned this pull request Jul 7, 2020

Share IT Infrastructure between Core Snapshot and SLM ITs (#59082) #59119

Merged

original-brownbear mentioned this pull request Jul 7, 2020

Fix SLM Tests Leaking Snapshot Operation #59150

Merged

original-brownbear mentioned this pull request Jul 7, 2020

Fix SLM Tests Leaking Snapshot Operation (#59150) #59155

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share IT Infrastructure between Core Snapshot and SLM ITs#59082

Share IT Infrastructure between Core Snapshot and SLM ITs#59082
original-brownbear merged 1 commit intoelastic:masterfrom
original-brownbear:unify-snapshot-test-infra

original-brownbear commented Jul 6, 2020 •

edited

Loading

Uh oh!

elasticmachine commented Jul 6, 2020

Uh oh!

elasticmachine commented Jul 6, 2020

Uh oh!

original-brownbear Jul 6, 2020 •

edited

Loading

Uh oh!

tlrx Jul 7, 2020

Uh oh!

original-brownbear Jul 7, 2020

Uh oh!

tlrx Jul 7, 2020

Uh oh!

original-brownbear Jul 7, 2020

Uh oh!

tlrx Jul 7, 2020

Uh oh!

original-brownbear Jul 6, 2020

Uh oh!

original-brownbear Jul 6, 2020

Uh oh!

tlrx left a comment

Uh oh!

original-brownbear commented Jul 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

original-brownbear commented Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Jul 6, 2020

Uh oh!

elasticmachine commented Jul 6, 2020

Uh oh!

original-brownbear Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrx Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

original-brownbear Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

tlrx Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

original-brownbear Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

tlrx Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

original-brownbear Jul 6, 2020

Choose a reason for hiding this comment

Uh oh!

original-brownbear Jul 6, 2020

Choose a reason for hiding this comment

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Jul 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

original-brownbear commented Jul 6, 2020 •

edited

Loading

original-brownbear Jul 6, 2020 •

edited

Loading