Add reserved snapshot/repo action#89601
Conversation
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
|
Hi @grcevski, I've created a changelog YAML for you. |
| return ClusterState.builder(currentState).metadata(mdBuilder).build(); | ||
| } | ||
|
|
||
| submitUnbatchedTask("put_repository [" + request.name() + "]", new RegisterRepositoryTask(this, request, acknowledgementStep) { |
There was a problem hiding this comment.
This refactoring simply moves the code so we can reuse the task execute method for the reserved state handler.
| listener.onFailure(e); | ||
| return; | ||
| } | ||
| validateRepository(request, listener); |
There was a problem hiding this comment.
The validation logic was extracted so we can use it in the reserved state handler.
There was a problem hiding this comment.
This seems broken, we should return here on any validation exception?
Maybe better to leave this a non-async method and do the try catch here and let the exception bubble up in the new spot this is used in where the listener is somewhat artificial anyway?
There was a problem hiding this comment.
Ah good catch, I'll refactor that method!
There was a problem hiding this comment.
So while fixing this I noticed something, the name validation is a sync failure, but creating the repository uses the listener. I'm not sure there's a reason for this difference, but I changed the code to mimic the original behaviour now.
| throw new RepositoryMissingException(request.name()); | ||
| } | ||
|
|
||
| submitUnbatchedTask("delete_repository [" + request.name() + "]", new UnregisterRepositoryTask(request, listener) { |
There was a problem hiding this comment.
Similar extraction of the execute logic in a standalone task to be reused.
| * | ||
| * @return a collection of optional reserved state handler names | ||
| */ | ||
| default Collection<String> optionalDependencies() { |
There was a problem hiding this comment.
SLM and ILM might depend on this snapshot repo configuration, but not always. This extends the handler interface to allow optional dependencies, they are here simply for ordering purposes so we can create the repo before the SLM policy.
|
|
||
| // package private for testing | ||
| Path operatorSettingsDir() { | ||
| public Path operatorSettingsDir() { |
There was a problem hiding this comment.
Making these public so that we can use them in tests outside of the package.
| ordered.add(key); | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
The state handlers are initialized on creation of the service, which is hung off the ActionModule. The Node creation code has many service interdependencies and it's impossible to create all reserved state handlers ahead of time. This API allows us the flexibility to add other state handlers as we build the modules in Node.
original-brownbear
left a comment
There was a problem hiding this comment.
Gave it a quick read and looks reasonable. Will do a deeper review soon, but I think found one issue that needs addressing until then.
| listener.onFailure(e); | ||
| return; | ||
| } | ||
| validateRepository(request, listener); |
There was a problem hiding this comment.
This seems broken, we should return here on any validation exception?
Maybe better to leave this a non-async method and do the try catch here and let the exception bubble up in the new spot this is used in where the listener is somewhat artificial anyway?
original-brownbear
left a comment
There was a problem hiding this comment.
Thanks Nikola, this looks alright to me now in terms of the cluster state updates and changes to the RepositoriesService.
But just generally speaking, shouldn't this have some end-to-end test in the form of an internal cluster test or REST test?
|
|
||
| Map<String, ?> source = parser.map(); | ||
|
|
||
| for (String name : source.keySet()) { |
There was a problem hiding this comment.
Not too important but this is a bit of a strange loop, why not loop EntrySet and avoid the source.get(name)?
Thanks Armin! Great point on the testing, I had a test when this work was together with the SLM policies, I forgot to bring the part of the integration test. I'll follow-up with and update now. |
The new test exposed a very rare bug where the file settings service was in the middle of processing the file when the node closed. This terminated the cluster state update task, but nobody unlocked the latch await. The fix allows the stop operation to properly terminate the watcher thread. Test added that exposes the bug.
Good call on asking for the test, I exposed a bug in the FileSettingsService related to node shutdown and processing the settings file. |
|
There's still a race condition. I need to think a bit more how to prevent it. |
There was one more race condition related to service stop. The setting of the processing latch raced against the stop method execution. The stop method may have still seen the processing latch as null or another stale instance. With this fix after we receive a processing latch, we check if the watcher state is still valid, and only then wait. Another test added consistently exposing the timing hole.
|
Hi @original-brownbear, I think I took care of the node shutdown bugs. I opened a separate issue to merge those changes here #89934. I think this is good to review again. |
original-brownbear
left a comment
There was a problem hiding this comment.
LGTM just one thing to maybe fix in the tests if you have a sec :) Thanks for the iterations!
| ); | ||
|
|
||
| // This should succeed, nothing was reserved | ||
| client().execute(PutRepositoryAction.INSTANCE, sampleRestRequest("err-repo")).actionGet(); |
There was a problem hiding this comment.
Can we use just plain get(), actionGet() hides the exact point where the exception was thrown by unwrapping the execution exception and unless it's needed for something like expectThrows above I'd rather avoid it because it makes debugging test failures a pain :)
| final var reposResponse = client().execute( | ||
| GetRepositoriesAction.INSTANCE, | ||
| new GetRepositoriesRequest(new String[] { "repo", "repo1" }) | ||
| ).actionGet(); |
There was a problem hiding this comment.
Lets use get here
|
|
||
| private void assertMasterNode(Client client, String node) { | ||
| assertThat( | ||
| client.admin().cluster().prepareState().execute().actionGet().getState().nodes().getMasterNode().getName(), |
|
Thanks Armin! |
This PR adds support for /_snapshot/repo file based settings.
Pre-requisite for: #89567
Relates to #89183