Add reserved snapshot/repo action by grcevski · Pull Request #89601 · elastic/elasticsearch

grcevski · 2022-08-24T21:01:59Z

This PR adds support for /_snapshot/repo file based settings.

Pre-requisite for: #89567

Relates to #89183

elasticsearchmachine · 2022-08-24T21:02:36Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine · 2022-08-24T21:02:36Z

Hi @grcevski, I've created a changelog YAML for you.

grcevski · 2022-08-24T21:04:49Z

server/src/main/java/org/elasticsearch/repositories/RepositoriesService.java

-                return ClusterState.builder(currentState).metadata(mdBuilder).build();
-            }
-
+        submitUnbatchedTask("put_repository [" + request.name() + "]", new RegisterRepositoryTask(this, request, acknowledgementStep) {


This refactoring simply moves the code so we can reuse the task execute method for the reserved state handler.

grcevski · 2022-08-24T21:05:10Z

server/src/main/java/org/elasticsearch/repositories/RepositoriesService.java

-            listener.onFailure(e);
-            return;
-        }
+        validateRepository(request, listener);


The validation logic was extracted so we can use it in the reserved state handler.

This seems broken, we should return here on any validation exception?
Maybe better to leave this a non-async method and do the try catch here and let the exception bubble up in the new spot this is used in where the listener is somewhat artificial anyway?

Ah good catch, I'll refactor that method!

So while fixing this I noticed something, the name validation is a sync failure, but creating the repository uses the listener. I'm not sure there's a reason for this difference, but I changed the code to mimic the original behaviour now.

grcevski · 2022-08-24T21:05:42Z

server/src/main/java/org/elasticsearch/repositories/RepositoriesService.java

-                throw new RepositoryMissingException(request.name());
-            }
-
+        submitUnbatchedTask("delete_repository [" + request.name() + "]", new UnregisterRepositoryTask(request, listener) {


Similar extraction of the execute logic in a standalone task to be reused.

grcevski · 2022-08-24T21:06:14Z

server/src/main/java/org/elasticsearch/reservedstate/ReservedClusterStateHandler.java

+     *
+     * @return a collection of optional reserved state handler names
+     */
+    default Collection<String> optionalDependencies() {


SLM and ILM might depend on this snapshot repo configuration, but not always. This extends the handler interface to allow optional dependencies, they are here simply for ordering purposes so we can create the repo before the SLM policy.

grcevski · 2022-08-24T21:06:39Z

server/src/main/java/org/elasticsearch/reservedstate/service/FileSettingsService.java


-    // package private for testing
-    Path operatorSettingsDir() {
+    public Path operatorSettingsDir() {


Making these public so that we can use them in tests outside of the package.

grcevski · 2022-08-24T21:08:47Z

server/src/main/java/org/elasticsearch/reservedstate/service/ReservedClusterStateService.java

        ordered.add(key);
    }
+
+    /**


The state handlers are initialized on creation of the service, which is hung off the ActionModule. The Node creation code has many service interdependencies and it's impossible to create all reserved state handlers ahead of time. This API allows us the flexibility to add other state handlers as we build the modules in Node.

original-brownbear

Gave it a quick read and looks reasonable. Will do a deeper review soon, but I think found one issue that needs addressing until then.

original-brownbear · 2022-08-25T11:38:43Z

server/src/main/java/org/elasticsearch/repositories/RepositoriesService.java

-            listener.onFailure(e);
-            return;
-        }
+        validateRepository(request, listener);


This seems broken, we should return here on any validation exception?
Maybe better to leave this a non-async method and do the try catch here and let the exception bubble up in the new spot this is used in where the listener is somewhat artificial anyway?

…o operator/repo

original-brownbear

Thanks Nikola, this looks alright to me now in terms of the cluster state updates and changes to the RepositoriesService.
But just generally speaking, shouldn't this have some end-to-end test in the form of an internal cluster test or REST test?

original-brownbear · 2022-09-07T15:19:51Z

.../elasticsearch/action/admin/cluster/repositories/reservedstate/ReservedRepositoryAction.java

+
+        Map<String, ?> source = parser.map();
+
+        for (String name : source.keySet()) {


Not too important but this is a bit of a strange loop, why not loop EntrySet and avoid the source.get(name)?

grcevski · 2022-09-07T16:20:09Z

Thanks Nikola, this looks alright to me now in terms of the cluster state updates and changes to the RepositoriesService.
But just generally speaking, shouldn't this have some end-to-end test in the form of an internal cluster test or REST test?

Thanks Armin! Great point on the testing, I had a test when this work was together with the SLM policies, I forgot to bring the part of the integration test. I'll follow-up with and update now.

The new test exposed a very rare bug where the file settings service was in the middle of processing the file when the node closed. This terminated the cluster state update task, but nobody unlocked the latch await. The fix allows the stop operation to properly terminate the watcher thread. Test added that exposes the bug.

grcevski · 2022-09-08T01:06:50Z

Thanks Nikola, this looks alright to me now in terms of the cluster state updates and changes to the RepositoriesService.
But just generally speaking, shouldn't this have some end-to-end test in the form of an internal cluster test or REST test?

Thanks Armin! Great point on the testing, I had a test when this work was together with the SLM policies, I forgot to bring the part of the integration test. I'll follow-up with and update now.

Good call on asking for the test, I exposed a bug in the FileSettingsService related to node shutdown and processing the settings file.

grcevski · 2022-09-08T02:19:58Z

There's still a race condition. I need to think a bit more how to prevent it.

There was one more race condition related to service stop. The setting of the processing latch raced against the stop method execution. The stop method may have still seen the processing latch as null or another stale instance. With this fix after we receive a processing latch, we check if the watcher state is still valid, and only then wait. Another test added consistently exposing the timing hole.

grcevski · 2022-09-08T15:11:51Z

Hi @original-brownbear, I think I took care of the node shutdown bugs. I opened a separate issue to merge those changes here #89934. I think this is good to review again.

original-brownbear

LGTM just one thing to maybe fix in the tests if you have a sec :) Thanks for the iterations!

original-brownbear · 2022-09-09T10:15:55Z

...rnalClusterTest/java/org/elasticsearch/reservedstate/service/RepositoriesFileSettingsIT.java

+        );
+
+        // This should succeed, nothing was reserved
+        client().execute(PutRepositoryAction.INSTANCE, sampleRestRequest("err-repo")).actionGet();


Can we use just plain get(), actionGet() hides the exact point where the exception was thrown by unwrapping the execution exception and unless it's needed for something like expectThrows above I'd rather avoid it because it makes debugging test failures a pain :)

original-brownbear · 2022-09-09T10:16:13Z

...rnalClusterTest/java/org/elasticsearch/reservedstate/service/RepositoriesFileSettingsIT.java

+        final var reposResponse = client().execute(
+            GetRepositoriesAction.INSTANCE,
+            new GetRepositoriesRequest(new String[] { "repo", "repo1" })
+        ).actionGet();


Lets use get here

original-brownbear · 2022-09-09T10:16:26Z

...rnalClusterTest/java/org/elasticsearch/reservedstate/service/RepositoriesFileSettingsIT.java

+
+    private void assertMasterNode(Client client, String node) {
+        assertThat(
+            client.admin().cluster().prepareState().execute().actionGet().getState().nodes().getMasterNode().getName(),


grcevski · 2022-09-12T13:22:57Z

Thanks Armin!

Add reserved snapshot/repo action

1c17f01

grcevski added >enhancement :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team v8.5.0 labels Aug 24, 2022

Update docs/changelog/89601.yaml

86852a4

grcevski commented Aug 24, 2022

View reviewed changes

original-brownbear self-requested a review August 25, 2022 11:11

original-brownbear reviewed Aug 25, 2022

View reviewed changes

Nikola Grcevski added 3 commits August 25, 2022 11:28

Fix bug on validate refactor

c52f6bd

Merge branch 'operator/repo' of github.com:grcevski/elasticsearch int…

387fa0a

…o operator/repo

Extract the name validation early

47adc3e

grcevski mentioned this pull request Aug 25, 2022

File Based Settings for Elasticsearch #89183

Closed

10 tasks

grcevski requested a review from original-brownbear August 29, 2022 14:28

original-brownbear reviewed Sep 7, 2022

View reviewed changes

Nikola Grcevski added 3 commits September 7, 2022 13:57

Merge branch 'main' into operator/repo

74ea341

Address PR comment.

6615dfa

Add integration test

f730e7a

grcevski requested a review from original-brownbear September 7, 2022 20:04

Nikola Grcevski added 2 commits September 7, 2022 20:07

Merge branch 'main' into operator/repo

188481c

grcevski mentioned this pull request Sep 8, 2022

Fix deadlock bug exposed by a test #89934

Merged

Merge branch 'main' into operator/repo

da94975

original-brownbear approved these changes Sep 9, 2022

View reviewed changes

Nikola Grcevski added 2 commits September 9, 2022 15:09

Switch to using get().

7e149f2

spotless

ee366ca

grcevski merged commit bdc0539 into elastic:main Sep 12, 2022

grcevski deleted the operator/repo branch September 12, 2022 13:22


		Map<String, ?> source = parser.map();

		for (String name : source.keySet()) {

Conversation

grcevski commented Aug 24, 2022

Uh oh!

elasticsearchmachine commented Aug 24, 2022

Uh oh!

elasticsearchmachine commented Aug 24, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grcevski commented Sep 7, 2022

Uh oh!

grcevski commented Sep 8, 2022

Uh oh!

grcevski commented Sep 8, 2022

Uh oh!

grcevski commented Sep 8, 2022

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grcevski commented Sep 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants