Skip to content

Make SnapshotResiliencyTests extensible for stateless#138333

Merged
elasticsearchmachine merged 34 commits intoelastic:mainfrom
ywangd:extend-snapshot-resiliency-tests-to-stateless
Jan 22, 2026
Merged

Make SnapshotResiliencyTests extensible for stateless#138333
elasticsearchmachine merged 34 commits intoelastic:mainfrom
ywangd:extend-snapshot-resiliency-tests-to-stateless

Conversation

@ywangd
Copy link
Copy Markdown
Member

@ywangd ywangd commented Nov 20, 2025

This a pure refactor for SnapshotResiliencyTests to make it possible to be extended on the stateless side. The main changes are (1) allowing subclasses for the test class and friends; (2) extracting TestCluster related code into its own file; (3) allowing predefined executors to run as part of DeterministicQueue.

@ywangd ywangd requested a review from DaveCTurner November 20, 2025 04:56
@ywangd ywangd added >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.3.0 labels Nov 20, 2025
@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Nov 20, 2025
Comment on lines +320 to +322
final CachedSupplier<ExecutorService> supplier = predefinedExecutors.get(name);
if (supplier != null) {
return supplier.get();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be controvesial. It allows a set of executors to run tasks without going through DeterministicTaskQueue. They are used to allow blocking reads to work.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bleurgh yes controversial indeed, the whole thing stops being deterministic if we do this which defeats the point.

I haven't had a chance to dig into this and need some time to think about this more (won't be for a few hours at least) but my initial thoughts are that I'd expect the single-threaded tests to work adequately even with blocking repo operations. The thing which won't work is blocking one thread on a future while expecting other threads to do some work to complete it, and that's a pattern we should be able to (and want to) avoid here anyway.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing which won't work is blocking one thread on a future while expecting other threads to do some work to complete it

Yes that's exactly the problem. We have quite many such use cases in the stateless path, see example here and here. There are more.

need some time to think about this more (won't be for a few hours at least)

No rush. Even coming to just this point took me a lot of time. Thanks for looking into it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole thing stops being deterministic if we do this which defeats the point

I was thinking that it would still retain the value if more interestsing work, e.g. the snapshot read/write, remains being executed deterministically. But yeah it still seems an ugly patchwork.

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's exactly the problem.

Right ok, this is the exact antipattern I think we should move away from: the underlying code is async and then there's an outer layer that creates a future and then blocks on its result in order to present a sync API. Typically it's quite a mechanical change to pull these things out to the caller, and then out again, etc. until either the API becomes async again and the blocking evaporates or at least until the blocking is no longer needed for this snapshot work.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer needed for this snapshot change

These blocking operations are not related to snapshot. They are used more fundamentally in Stateless including recovery, refresh, flush, cluster state persistence. IIUC, they were added to satisfy Lucene's synchronous IO pattern.

I agree we want to eventually moving away from them. It seems a significant undertaking. I am OK if we decide to defer extending SnapshotResiliencyTests to stateless until we fix the blocking operations. But I think we don't want to tie it to the stateless snapshot developement. In fact, by reading shard data directly from the object store, it removes a path for blocking reading the data from the local shard.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline and David has a great suggestion of using the direct executor for such blocking calls so that the future is ready when the blocking happens. I have updated the implementation based on this idea.

return mock(PluginsService.class);
}

public final void init() throws IOException {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest change for TestClusterNode is that initialization is split between instantation and this new init() method. Previously it was all done at object instanation time which was not feasible for extending on the serverless side.

throw new UnsupportedOperationException();
}
protected DeterministicThreadPool(Function<Runnable, Runnable> runnableWrapper) {
this.runnableWrapper = runnable -> getThreadContext().preserveContext(runnableWrapper.apply(runnable));
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix for always preserving threadContext for each task. Also added a test for it in 2ad3965

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but again I'd slightly prefer this to be a preliminary PR just to make it easier to review with confidence that it's all properly tested.

public <T> T invokeAny(Collection<? extends Callable<T>> tasks) {
throw new UnsupportedOperationException();
}
protected class DeterministicThreadPool extends ThreadPool {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change to DeterministicTaskQueue is pull the annoymous ThreadPool class to a named class so that it can be subclassed. I suggest viewing the diff with "Hide whitespace" to minimize the visual clutter.

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me but I would like to break this change up into several steps to avoid doing too much all at once and missing something in the reviews. In particular, as well as pulling out various bits of functionality I believe this also changes the behaviour of SnapshotResiliencyTests in some (hopefully unimportant) ways. For instance:

  • maybeWaitForGreenAfterRestore
  • the injection of index settings
  • another stateless-only branch in testSnapshotPrimaryRelocations
  • delayed startup in restart()

None of this looks wrong, I'm just uncomfortable with doing a big mechanical change plus some slightly fiddly manual changes all at once. I called out in inline comments some other bits I'd prefer to be done as preliminary steps too.

SnapshotsInfoService snapshotsInfoService
) {
super(
this(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather add this constructor parameter to all the callers (there's not very many) rather than delegating to another constructor here. Could be done as a preliminary step to reduce noise.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I split these change to #140153

throw new UnsupportedOperationException();
}
protected DeterministicThreadPool(Function<Runnable, Runnable> runnableWrapper) {
this.runnableWrapper = runnable -> getThreadContext().preserveContext(runnableWrapper.apply(runnable));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but again I'd slightly prefer this to be a preliminary PR just to make it easier to review with confidence that it's all properly tested.

Comment on lines +461 to +462
public ScheduledFuture<?> schedule(Runnable command, long delay, TimeUnit unit) {
// For GlobalCheckpointListeners#add
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserves more testing I think (and thus likely a separate PR)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I split changes for DeterministicQueue to its own PR #140151

The PR includes all 3 changes to the class. Please let me know if you'd like to further split it. Thanks!

Comment on lines +346 to +348
protected final Function<Runnable, Runnable> runnableWrapper;

@Override
public ScheduledFuture<?> scheduleAtFixedRate(Runnable command, long initialDelay, long period, TimeUnit unit) {
throw new UnsupportedOperationException();
}
protected DeterministicThreadPool(Function<Runnable, Runnable> runnableWrapper) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: would prefer constructors and fields at the top of the class

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are at the top of the class. It might be a bit hidden because the instance level initialization code for forkingExecutor. I can move the initialization into the constructor if it helps. I didn't do it to minimize the change set. Please let me know.

ywangd added a commit to ywangd/elasticsearch that referenced this pull request Jan 5, 2026
This PR makes MockAllocationService constructor take an explicit
ShardRoutingRoleStrategy parameter so that it can be customized.

Relates: elastic#138333
@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Jan 5, 2026

  • maybeWaitForGreenAfterRestore
  • the injection of index settings
  • another stateless-only branch in testSnapshotPrimaryRelocations
  • delayed startup in restart()

I raised a separate PR #140152 for just SnapshotResiliencyTests and it removes the above changes. The new PR should be more of a pure refactor. Please let me know if you spot anything for future cleanups. Thanks!

EDIT: For easy consumption, the 3 PRs splitted from this one are the follows:

  1. DeterministicQueue refactor and enhancement #140151
  2. Refactor SnapshotResiliencyTests #140152
  3. Refactor MockAllocationService constructor #140153

elasticsearchmachine pushed a commit that referenced this pull request Jan 6, 2026
This PR makes MockAllocationService constructor take an explicit
ShardRoutingRoleStrategy parameter so that it can be customized.

Relates: #138333
elasticsearchmachine pushed a commit that referenced this pull request Jan 7, 2026
This PR makes following changes to DeterministicTaskQueue 1. Promote the
anonymous ThreadPool class to a named inner class,
`DeterministicThreadPool`. 2. Fix threadContext preservation for all
tasks 3. Add a basic implementation for
`ScheduledExecutorService#schedule(Runnable, long, TimeUnit)`. 

The PR is split from #138333
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Jan 7, 2026
This PR makes following changes to DeterministicTaskQueue 1. Promote the
anonymous ThreadPool class to a named inner class,
`DeterministicThreadPool`. 2. Fix threadContext preservation for all
tasks 3. Add a basic implementation for
`ScheduledExecutorService#schedule(Runnable, long, TimeUnit)`. 

The PR is split from elastic#138333
elasticsearchmachine pushed a commit that referenced this pull request Jan 20, 2026
Refactor for `SnapshotResiliencyTests` to make it easier to extend. The
main changes are (1) extract TestCluster related code into its own file
and (2) add a separate init phase to the test node so that it can be
customized by future subclasses. It also extracts serveral protected
methods from existing code so they can be customized in future as well.

Relates: #138333
Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return true;
}).sorted(Comparator.comparing(n -> n.node.getName())).toList();
return dataNodes.isEmpty() ? Optional.empty() : Optional.ofNullable(randomFrom(dataNodes));
return dataNodes.isEmpty() ? Optional.empty() : Optional.ofNullable(ESTestCase.randomFrom(dataNodes));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we use qualified imports for these things?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah some merge conflicts didn't resolve as they should have. This was one of them. I have now cleaned up in abb4f3b Thanks!

spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Jan 21, 2026
Refactor for `SnapshotResiliencyTests` to make it easier to extend. The
main changes are (1) extract TestCluster related code into its own file
and (2) add a separate init phase to the test node so that it can be
customized by future subclasses. It also extracts serveral protected
methods from existing code so they can be customized in future as well.

Relates: elastic#138333
@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 22, 2026
@elasticsearchmachine elasticsearchmachine merged commit ff7d81a into elastic:main Jan 22, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs serverless-linked Added by automation, don't add manually Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants