Skip to content

Add reserved snapshot/repo action#89601

Merged
grcevski merged 14 commits intoelastic:mainfrom
grcevski:operator/repo
Sep 12, 2022
Merged

Add reserved snapshot/repo action#89601
grcevski merged 14 commits intoelastic:mainfrom
grcevski:operator/repo

Conversation

@grcevski
Copy link
Copy Markdown
Contributor

This PR adds support for /_snapshot/repo file based settings.

Pre-requisite for: #89567

Relates to #89183

@grcevski grcevski added >enhancement :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team v8.5.0 labels Aug 24, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @grcevski, I've created a changelog YAML for you.

return ClusterState.builder(currentState).metadata(mdBuilder).build();
}

submitUnbatchedTask("put_repository [" + request.name() + "]", new RegisterRepositoryTask(this, request, acknowledgementStep) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactoring simply moves the code so we can reuse the task execute method for the reserved state handler.

listener.onFailure(e);
return;
}
validateRepository(request, listener);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation logic was extracted so we can use it in the reserved state handler.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems broken, we should return here on any validation exception?
Maybe better to leave this a non-async method and do the try catch here and let the exception bubble up in the new spot this is used in where the listener is somewhat artificial anyway?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, I'll refactor that method!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So while fixing this I noticed something, the name validation is a sync failure, but creating the repository uses the listener. I'm not sure there's a reason for this difference, but I changed the code to mimic the original behaviour now.

throw new RepositoryMissingException(request.name());
}

submitUnbatchedTask("delete_repository [" + request.name() + "]", new UnregisterRepositoryTask(request, listener) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar extraction of the execute logic in a standalone task to be reused.

*
* @return a collection of optional reserved state handler names
*/
default Collection<String> optionalDependencies() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SLM and ILM might depend on this snapshot repo configuration, but not always. This extends the handler interface to allow optional dependencies, they are here simply for ordering purposes so we can create the repo before the SLM policy.


// package private for testing
Path operatorSettingsDir() {
public Path operatorSettingsDir() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making these public so that we can use them in tests outside of the package.

ordered.add(key);
}

/**
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state handlers are initialized on creation of the service, which is hung off the ActionModule. The Node creation code has many service interdependencies and it's impossible to create all reserved state handlers ahead of time. This API allows us the flexibility to add other state handlers as we build the modules in Node.

@original-brownbear original-brownbear self-requested a review August 25, 2022 11:11
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it a quick read and looks reasonable. Will do a deeper review soon, but I think found one issue that needs addressing until then.

listener.onFailure(e);
return;
}
validateRepository(request, listener);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems broken, we should return here on any validation exception?
Maybe better to leave this a non-async method and do the try catch here and let the exception bubble up in the new spot this is used in where the listener is somewhat artificial anyway?

Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nikola, this looks alright to me now in terms of the cluster state updates and changes to the RepositoriesService.
But just generally speaking, shouldn't this have some end-to-end test in the form of an internal cluster test or REST test?


Map<String, ?> source = parser.map();

for (String name : source.keySet()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too important but this is a bit of a strange loop, why not loop EntrySet and avoid the source.get(name)?

@grcevski
Copy link
Copy Markdown
Contributor Author

grcevski commented Sep 7, 2022

Thanks Nikola, this looks alright to me now in terms of the cluster state updates and changes to the RepositoriesService.
But just generally speaking, shouldn't this have some end-to-end test in the form of an internal cluster test or REST test?

Thanks Armin! Great point on the testing, I had a test when this work was together with the SLM policies, I forgot to bring the part of the integration test. I'll follow-up with and update now.

Nikola Grcevski added 2 commits September 7, 2022 20:07
The new test exposed a very rare bug where the
file settings service was in the middle of processing
the file when the node closed. This terminated the
cluster state update task, but nobody unlocked the
latch await. The fix allows the stop operation to
properly terminate the watcher thread.

Test added that exposes the bug.
@grcevski
Copy link
Copy Markdown
Contributor Author

grcevski commented Sep 8, 2022

Thanks Nikola, this looks alright to me now in terms of the cluster state updates and changes to the RepositoriesService.
But just generally speaking, shouldn't this have some end-to-end test in the form of an internal cluster test or REST test?

Thanks Armin! Great point on the testing, I had a test when this work was together with the SLM policies, I forgot to bring the part of the integration test. I'll follow-up with and update now.

Good call on asking for the test, I exposed a bug in the FileSettingsService related to node shutdown and processing the settings file.

@grcevski
Copy link
Copy Markdown
Contributor Author

grcevski commented Sep 8, 2022

There's still a race condition. I need to think a bit more how to prevent it.

There was one more race condition related to service
stop. The setting of the processing latch raced against
the stop method execution. The stop method may have
still seen the processing latch as null or another stale
instance. With this fix after we receive a processing
latch, we check if the watcher state is still valid, and
only then wait.

Another test added consistently exposing the timing hole.
@grcevski
Copy link
Copy Markdown
Contributor Author

grcevski commented Sep 8, 2022

Hi @original-brownbear, I think I took care of the node shutdown bugs. I opened a separate issue to merge those changes here #89934. I think this is good to review again.

Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just one thing to maybe fix in the tests if you have a sec :) Thanks for the iterations!

);

// This should succeed, nothing was reserved
client().execute(PutRepositoryAction.INSTANCE, sampleRestRequest("err-repo")).actionGet();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use just plain get(), actionGet() hides the exact point where the exception was thrown by unwrapping the execution exception and unless it's needed for something like expectThrows above I'd rather avoid it because it makes debugging test failures a pain :)

final var reposResponse = client().execute(
GetRepositoriesAction.INSTANCE,
new GetRepositoriesRequest(new String[] { "repo", "repo1" })
).actionGet();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use get here


private void assertMasterNode(Client client, String node) {
assertThat(
client.admin().cluster().prepareState().execute().actionGet().getState().nodes().getMasterNode().getName(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get :)

@grcevski grcevski merged commit bdc0539 into elastic:main Sep 12, 2022
@grcevski grcevski deleted the operator/repo branch September 12, 2022 13:22
@grcevski
Copy link
Copy Markdown
Contributor Author

Thanks Armin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Core Core issues without another label >enhancement Team:Core/Infra Meta label for core/infra team v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants