Skip to content

Share IT Infrastructure between Core Snapshot and SLM ITs#59082

Merged
original-brownbear merged 1 commit intoelastic:masterfrom
original-brownbear:unify-snapshot-test-infra
Jul 7, 2020
Merged

Share IT Infrastructure between Core Snapshot and SLM ITs#59082
original-brownbear merged 1 commit intoelastic:masterfrom
original-brownbear:unify-snapshot-test-infra

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear commented Jul 6, 2020

For #58994 it would be useful to be able to share test infrastructure.
This PR shares AbstractSnapshotIntegTestCase for that purpose, dries up SLM tests
accordingly and adds a shared and efficient (compared to the previous implementations)
way of waiting for no running snapshot operations to the test infrastructure to dry things up further.

Note: the shared way of waiting for no more running operations was extracted from #56911 so this PR also decreases the size of that huge PR :)

For #58994 it would be useful to be able to share test infrastructure.
This PR shares `AbstractSnapshotIntegTestCase` for that purpose, dries up SLM tests
accordingly and adds a shared and efficient (compared to the previous implementations)
way of waiting for no running snapshot operations to the test infrastructure to dry things up further.

Note: the shared way of waiting for no more running operations was extracted from #56911
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. v8.0.0 v7.9.0 labels Jul 6, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Jul 6, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@elasticmachine elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Jul 6, 2020
final ClusterService clusterService = internalCluster().getInstance(ClusterService.class, viaNode);
final ThreadPool threadPool = internalCluster().getInstance(ThreadPool.class, viaNode);
final ClusterStateObserver observer = new ClusterStateObserver(clusterService, logger, threadPool.getThreadContext());
if (statePredicate.test(observer.setAndGetObservedState()) == false) {
Copy link
Copy Markdown
Contributor Author

@original-brownbear original-brownbear Jul 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot better than:

dataNodeClient().admin().cluster().prepareState().get().getState();

+ busy assert.

Often times the tests use this kind of waiting when the busy assert will fail for a bit and then waste a second or two until the next run because of exponential back-off (over the large number of tests that do this kind of waiting for a certain CS it's well worth taking this approach IMO, especially with the concurrent snapshot ITs adding a large number of new tests that need this thing).
Also, the client() approach can (in disruption ITs) randomly go for the client of an isolated node and then waste effort and time for retrying in the transport master node action.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the motivation; I'm wondering if it should always wait for next change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think waiting for the next change would make it very hard to avoid races. We often have this pattern:

  1. do operation
  2. wait for no more operations running

If the first step completes before we start waiting we dead-lock. And this is just one example, the concurrent snapshotting tests make use of this logic in other spots where similar races could occur

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I was wondering if something similar could happen: 1. do operation and 2. check cluster state on a data node that has not yet processed the updated cluster state

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yea that was an issue in the concurrency tests, in the end it just requires ensuring that stuff actually started by some other means before waiting for it to go away :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! Thanks

}
}

public static void unblockAllDataNodes(String repository) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these methods were just copies from AbstractSnapshotIntegTestCase that's why no other changes to the code were needed here.

assertEquals(SnapshotState.SUCCESS, snapshotInfo.state());
});
}
awaitNoMoreRunningOperations(internalCluster().getMasterName());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to add a new wait here because the AbstractSnapshotIntegTestCase does some repo consistency checks in after the tests (well worth it to have these here anyway) which will break if there's still work done in the cluster (which may be the case in this test).

Copy link
Copy Markdown
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I left a minor question

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks Tanguy!

@original-brownbear original-brownbear merged commit 4ed6c0e into elastic:master Jul 7, 2020
@original-brownbear original-brownbear deleted the unify-snapshot-test-infra branch July 7, 2020 08:39
original-brownbear added a commit that referenced this pull request Jul 7, 2020
…59119)

For #58994 it would be useful to be able to share test infrastructure.
This PR shares `AbstractSnapshotIntegTestCase` for that purpose, dries up SLM tests
accordingly and adds a shared and efficient (compared to the previous implementations)
way of waiting for no running snapshot operations to the test infrastructure to dry things up further.
original-brownbear added a commit that referenced this pull request Jul 7, 2020
Fixed an issue #59082 introduced. We have to wait for no more operations
in all tests here not just the one we were waiting in already so that the cleanup
operation from the parent class can run without failure.
original-brownbear added a commit that referenced this pull request Jul 7, 2020
Fixed an issue #59082 introduced. We have to wait for no more operations
in all tests here not just the one we were waiting in already so that the cleanup
operation from the parent class can run without failure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Data Management (obsolete) DO NOT USE. This team no longer exists. Team:Distributed Meta label for distributed team. >test Issues or PRs that are addressing/adding tests v7.9.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants