Skip to content

Refactor SnapshotResiliencyTests#140152

Merged
elasticsearchmachine merged 40 commits intoelastic:mainfrom
ywangd:snapshot-resiliency-tests-refactor
Jan 20, 2026
Merged

Refactor SnapshotResiliencyTests#140152
elasticsearchmachine merged 40 commits intoelastic:mainfrom
ywangd:snapshot-resiliency-tests-refactor

Conversation

@ywangd
Copy link
Copy Markdown
Member

@ywangd ywangd commented Jan 5, 2026

Refactor for SnapshotResiliencyTests to make it easier to extend. The main changes are (1) extract TestCluster related code into its own file and (2) add a separate init phase to the test node so that it can be customized by future subclasses. It also extracts serveral protected methods from existing code so they can be customized in future as well.

Relates: #138333

Also support waiting for a condition before starting the restart node.
@ywangd ywangd requested a review from DaveCTurner January 5, 2026 07:54
@ywangd ywangd added the >test Issues or PRs that are addressing/adding tests label Jan 5, 2026
@ywangd ywangd added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.4.0 labels Jan 5, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Jan 5, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks broadly good to me, just a few questions/nits.

}
return true;
}).sorted(Comparator.comparing(n -> n.node.getName())).toList();
return dataNodes.isEmpty() ? Optional.empty() : Optional.ofNullable(ESTestCase.randomFrom(dataNodes));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: slight preference for qualified imports rather than qualifying the method calls

Suggested change
return dataNodes.isEmpty() ? Optional.empty() : Optional.ofNullable(ESTestCase.randomFrom(dataNodes));
return dataNodes.isEmpty() ? Optional.empty() : Optional.ofNullable(randomFrom(dataNodes));

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure pushed 94a9744

}
}

public static class TestClusterNode {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why raise this to a first-level inner static class rather than remaining an inner class of TestClusterNodes? I'm guessing it's so that we can subclass it later? But if you're in a subclass of TestClusterNodes can you not subclass TestClusterNode? Haven't tried it, just curious really.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sorta generally prefer top level classes. But it is also ok to keep it as an inner class of TestClusterNodes. We can still subclass it as you suggested which is what happened for DeterministicQueue#DeterministicThreadPool. So I changed it accordingly in 9d6806d
Since it touches the constructor, the change also moved both nodeSettings() and createPluginsService() methods to TestClusterNodes so that we can remove the this-escape.

Copy link
Copy Markdown
Member Author

@ywangd ywangd Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you want TestClusterNodes to be put back into SnapshotResiliencyTests. I'd like to still make it a static class in that case. That said, I find it overall more manageable with two separate files since both of them have quite large sizes and we could potentially use the test cluster for other things in future.

Comment on lines +1112 to +1113
actions.put(
TransportClusterHealthAction.TYPE,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to register this action now? Seems like a little more than a pure refactor if so. If not, can we delay it until it's needed?

Copy link
Copy Markdown
Member Author

@ywangd ywangd Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this action registration along with the doInit method in 1dbc330

)
);

doInit(actions, actionFilters);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here, this hook doesn't do anything yet and adding it will be relatively noise-free later.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above


private Coordinator coordinator;

@SuppressWarnings("this-escape")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really allow a this to escape, given that we now have a separate init() method?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I see now, I thought it picked this up earlier in the precommit run than it did. The issue is the calls to the instance methods nodeSettings() and createPluginsService(), which do not leak this here but might do so in a subclass I guess. But I wonder, do we need these to be overridable by subclasses or could we pass them in as contructor parameters instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented above, it is removed in 9d6806d


runUntil(documentCountVerified::get, TimeUnit.MINUTES.toMillis(5L));
assertNotNull(safeResult(createSnapshotResponseListener));
assertNotNull(safeResult(restoreSnapshotResponseListener));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm do we need to drop this assertion?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a mistake. Restored in 8970408 Thanks for catching it.

Comment on lines +934 to +936
.filter(ShardRouting::primary)
.findFirst()
.orElseThrow(() -> new AssertionError("no primary shard found"));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference for deferring this change in behaviour.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, restored in 266bd8b

null,
emptySet()
);
// TODO: The indexNameExpressionResolver does not use the same threadContext and projectResolver
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this important? It sounds important...

Copy link
Copy Markdown
Member Author

@ywangd ywangd Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically it is not important since the tests do not rely on either multi-project or system indices. I didn't touch it because there are other inconsistencies with projectResolver and I was not sure whether it was the right PR to fix them. Since we now have these changes separately. I pushed 4b5d66a to fix them.

@ywangd ywangd requested a review from DaveCTurner January 6, 2026 02:16
@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Jan 8, 2026

@DaveCTurner This PR is ready for another look when you have time. Thanks!

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Jan 15, 2026

@DaveCTurner Sorry to ping you again. Your review is appreciated when convenient. Thanks a lot!

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, thanks Yang


continueOrDie(clusterStateResponseStepListener, clusterStateResponse -> {
final ShardRouting shardToRelocate = clusterStateResponse.getState().routingTable().allShards(index).get(0);
;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like a typo here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Removed in f15e670

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 19, 2026
@elasticsearchmachine elasticsearchmachine merged commit ba61a61 into elastic:main Jan 20, 2026
35 checks passed
@ywangd ywangd deleted the snapshot-resiliency-tests-refactor branch January 20, 2026 01:11
spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Jan 21, 2026
Refactor for `SnapshotResiliencyTests` to make it easier to extend. The
main changes are (1) extract TestCluster related code into its own file
and (2) add a separate init phase to the test node so that it can be
customized by future subclasses. It also extracts serveral protected
methods from existing code so they can be customized in future as well.

Relates: elastic#138333
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants