Make SnapshotResiliencyTests extensible for stateless#138333
Make SnapshotResiliencyTests extensible for stateless#138333elasticsearchmachine merged 34 commits intoelastic:mainfrom
Conversation
Also expand DeterministicTaskQueue
...framework/src/main/java/org/elasticsearch/common/util/concurrent/DeterministicTaskQueue.java
Outdated
Show resolved
Hide resolved
| final CachedSupplier<ExecutorService> supplier = predefinedExecutors.get(name); | ||
| if (supplier != null) { | ||
| return supplier.get(); |
There was a problem hiding this comment.
This might be controvesial. It allows a set of executors to run tasks without going through DeterministicTaskQueue. They are used to allow blocking reads to work.
There was a problem hiding this comment.
Bleurgh yes controversial indeed, the whole thing stops being deterministic if we do this which defeats the point.
I haven't had a chance to dig into this and need some time to think about this more (won't be for a few hours at least) but my initial thoughts are that I'd expect the single-threaded tests to work adequately even with blocking repo operations. The thing which won't work is blocking one thread on a future while expecting other threads to do some work to complete it, and that's a pattern we should be able to (and want to) avoid here anyway.
There was a problem hiding this comment.
The thing which won't work is blocking one thread on a future while expecting other threads to do some work to complete it
Yes that's exactly the problem. We have quite many such use cases in the stateless path, see example here and here. There are more.
need some time to think about this more (won't be for a few hours at least)
No rush. Even coming to just this point took me a lot of time. Thanks for looking into it.
There was a problem hiding this comment.
the whole thing stops being deterministic if we do this which defeats the point
I was thinking that it would still retain the value if more interestsing work, e.g. the snapshot read/write, remains being executed deterministically. But yeah it still seems an ugly patchwork.
There was a problem hiding this comment.
Yes that's exactly the problem.
Right ok, this is the exact antipattern I think we should move away from: the underlying code is async and then there's an outer layer that creates a future and then blocks on its result in order to present a sync API. Typically it's quite a mechanical change to pull these things out to the caller, and then out again, etc. until either the API becomes async again and the blocking evaporates or at least until the blocking is no longer needed for this snapshot work.
There was a problem hiding this comment.
no longer needed for this snapshot change
These blocking operations are not related to snapshot. They are used more fundamentally in Stateless including recovery, refresh, flush, cluster state persistence. IIUC, they were added to satisfy Lucene's synchronous IO pattern.
I agree we want to eventually moving away from them. It seems a significant undertaking. I am OK if we decide to defer extending SnapshotResiliencyTests to stateless until we fix the blocking operations. But I think we don't want to tie it to the stateless snapshot developement. In fact, by reading shard data directly from the object store, it removes a path for blocking reading the data from the local shard.
There was a problem hiding this comment.
We discussed offline and David has a great suggestion of using the direct executor for such blocking calls so that the future is ready when the blocking happens. I have updated the implementation based on this idea.
…iency-tests-to-stateless
…iency-tests-to-stateless
Also support waiting for a condition before starting the restart node.
…iency-tests-to-stateless
server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTestHelper.java
Show resolved
Hide resolved
| return mock(PluginsService.class); | ||
| } | ||
|
|
||
| public final void init() throws IOException { |
There was a problem hiding this comment.
The biggest change for TestClusterNode is that initialization is split between instantation and this new init() method. Previously it was all done at object instanation time which was not feasible for extending on the serverless side.
...framework/src/main/java/org/elasticsearch/common/util/concurrent/DeterministicTaskQueue.java
Outdated
Show resolved
Hide resolved
...framework/src/main/java/org/elasticsearch/common/util/concurrent/DeterministicTaskQueue.java
Outdated
Show resolved
Hide resolved
| throw new UnsupportedOperationException(); | ||
| } | ||
| protected DeterministicThreadPool(Function<Runnable, Runnable> runnableWrapper) { | ||
| this.runnableWrapper = runnable -> getThreadContext().preserveContext(runnableWrapper.apply(runnable)); |
There was a problem hiding this comment.
This is the fix for always preserving threadContext for each task. Also added a test for it in 2ad3965
There was a problem hiding this comment.
Makes sense, but again I'd slightly prefer this to be a preliminary PR just to make it easier to review with confidence that it's all properly tested.
| public <T> T invokeAny(Collection<? extends Callable<T>> tasks) { | ||
| throw new UnsupportedOperationException(); | ||
| } | ||
| protected class DeterministicThreadPool extends ThreadPool { |
There was a problem hiding this comment.
The main change to DeterministicTaskQueue is pull the annoymous ThreadPool class to a named class so that it can be subclassed. I suggest viewing the diff with "Hide whitespace" to minimize the visual clutter.
DaveCTurner
left a comment
There was a problem hiding this comment.
Overall this looks good to me but I would like to break this change up into several steps to avoid doing too much all at once and missing something in the reviews. In particular, as well as pulling out various bits of functionality I believe this also changes the behaviour of SnapshotResiliencyTests in some (hopefully unimportant) ways. For instance:
maybeWaitForGreenAfterRestore- the injection of index settings
- another stateless-only branch in
testSnapshotPrimaryRelocations - delayed startup in
restart()
None of this looks wrong, I'm just uncomfortable with doing a big mechanical change plus some slightly fiddly manual changes all at once. I called out in inline comments some other bits I'd prefer to be done as preliminary steps too.
| SnapshotsInfoService snapshotsInfoService | ||
| ) { | ||
| super( | ||
| this( |
There was a problem hiding this comment.
I'd rather add this constructor parameter to all the callers (there's not very many) rather than delegating to another constructor here. Could be done as a preliminary step to reduce noise.
| throw new UnsupportedOperationException(); | ||
| } | ||
| protected DeterministicThreadPool(Function<Runnable, Runnable> runnableWrapper) { | ||
| this.runnableWrapper = runnable -> getThreadContext().preserveContext(runnableWrapper.apply(runnable)); |
There was a problem hiding this comment.
Makes sense, but again I'd slightly prefer this to be a preliminary PR just to make it easier to review with confidence that it's all properly tested.
| public ScheduledFuture<?> schedule(Runnable command, long delay, TimeUnit unit) { | ||
| // For GlobalCheckpointListeners#add |
There was a problem hiding this comment.
This deserves more testing I think (and thus likely a separate PR)
There was a problem hiding this comment.
Sure. I split changes for DeterministicQueue to its own PR #140151
The PR includes all 3 changes to the class. Please let me know if you'd like to further split it. Thanks!
| protected final Function<Runnable, Runnable> runnableWrapper; | ||
|
|
||
| @Override | ||
| public ScheduledFuture<?> scheduleAtFixedRate(Runnable command, long initialDelay, long period, TimeUnit unit) { | ||
| throw new UnsupportedOperationException(); | ||
| } | ||
| protected DeterministicThreadPool(Function<Runnable, Runnable> runnableWrapper) { |
There was a problem hiding this comment.
Nit: would prefer constructors and fields at the top of the class
There was a problem hiding this comment.
They are at the top of the class. It might be a bit hidden because the instance level initialization code for forkingExecutor. I can move the initialization into the constructor if it helps. I didn't do it to minimize the change set. Please let me know.
This PR makes MockAllocationService constructor take an explicit ShardRoutingRoleStrategy parameter so that it can be customized. Relates: elastic#138333
I raised a separate PR #140152 for just EDIT: For easy consumption, the 3 PRs splitted from this one are the follows: |
This PR makes MockAllocationService constructor take an explicit ShardRoutingRoleStrategy parameter so that it can be customized. Relates: #138333
This PR makes following changes to DeterministicTaskQueue 1. Promote the anonymous ThreadPool class to a named inner class, `DeterministicThreadPool`. 2. Fix threadContext preservation for all tasks 3. Add a basic implementation for `ScheduledExecutorService#schedule(Runnable, long, TimeUnit)`. The PR is split from #138333
This PR makes following changes to DeterministicTaskQueue 1. Promote the anonymous ThreadPool class to a named inner class, `DeterministicThreadPool`. 2. Fix threadContext preservation for all tasks 3. Add a basic implementation for `ScheduledExecutorService#schedule(Runnable, long, TimeUnit)`. The PR is split from elastic#138333
Refactor for `SnapshotResiliencyTests` to make it easier to extend. The main changes are (1) extract TestCluster related code into its own file and (2) add a separate init phase to the test node so that it can be customized by future subclasses. It also extracts serveral protected methods from existing code so they can be customized in future as well. Relates: #138333
…iency-tests-to-stateless
| return true; | ||
| }).sorted(Comparator.comparing(n -> n.node.getName())).toList(); | ||
| return dataNodes.isEmpty() ? Optional.empty() : Optional.ofNullable(randomFrom(dataNodes)); | ||
| return dataNodes.isEmpty() ? Optional.empty() : Optional.ofNullable(ESTestCase.randomFrom(dataNodes)); |
There was a problem hiding this comment.
nit: could we use qualified imports for these things?
There was a problem hiding this comment.
Yeah some merge conflicts didn't resolve as they should have. This was one of them. I have now cleaned up in abb4f3b Thanks!
Refactor for `SnapshotResiliencyTests` to make it easier to extend. The main changes are (1) extract TestCluster related code into its own file and (2) add a separate init phase to the test node so that it can be customized by future subclasses. It also extracts serveral protected methods from existing code so they can be customized in future as well. Relates: elastic#138333
…iency-tests-to-stateless
This a pure refactor for
SnapshotResiliencyTeststo make it possible to be extended on the stateless side. The main changes are (1) allowing subclasses for the test class and friends; (2) extracting TestCluster related code into its own file; (3) allowing predefined executors to run as part of DeterministicQueue.