Updating HealthService to use FetchHealthInfoCacheAction by masseyke · Pull Request #89947 · elastic/elasticsearch

masseyke · 2022-09-08T18:59:19Z

This PR updates HealthService to fetch HealthInfo data using FetchHealthInfoCacheAction (#89820), and to pass it to the HealthIndicatorServices' calculate method. This will allow HealthIndicatorServices to use any data from the health node's HealthInfoCache.

elasticsearchmachine · 2022-09-09T13:12:53Z

Pinging @elastic/es-data-management (Team:Data Management)

andreidan · 2022-09-12T09:19:32Z

Relates to #84811

gmarouli

Wow the last part of the disk indicator!! How exciting, thank you for putting this together @masseyke . I did quick pass and I had two comments that might have quite some impact so after we make some progress on those I will do a more detailed pass. Let me know if I can help :).

gmarouli · 2022-09-12T09:57:38Z

server/src/main/java/org/elasticsearch/health/HealthService.java

+                        FetchHealthInfoCacheAction.INSTANCE,
+                        new FetchHealthInfoCacheAction.Request()
+                    );
+                    FetchHealthInfoCacheAction.Response response = responseActionFuture.actionGet();


Can you rework this to not use a blocking call? I think it should be possible. Let me know if I can help.

I think we want this to be a blocking call right? What do you have in mind?

Sorry I misunderstood you. I've changed the code so that it no longer blocks the thread (but still blocks on responding to the user).

server/src/main/java/org/elasticsearch/health/HealthService.java

andreidan

Thanks for working on this Keith

Left a comment

andreidan · 2022-09-12T10:57:47Z

server/src/main/java/org/elasticsearch/health/HealthService.java

+                        FetchHealthInfoCacheAction.INSTANCE,
+                        new FetchHealthInfoCacheAction.Request()
+                    );
+                    FetchHealthInfoCacheAction.Response response = responseActionFuture.actionGet();


A few things here:

the "fetch health request" will not create a cancellable task (similar to https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/action/admin/indices/diskusage/AnalyzeDiskUsageShardRequest.java#L42 )

we shouldn't be blocking the calling thread here - so let's use the async call (listener etc)

we should add a timeout here as the response to the health API should be prompt (also the timeout will be used as an indication that we don't have health information)

For the first bullet point, this PR will address this: #90003.

I've made the change for the 2nd. The timeout is set (well currently not set) in TransportHealthNodeAction. Does 10s sound reasonable for all health node actions? Or do we need to make this configurable differently for different actions? I would think 10s would be fine because all of this data is fairly time-sensitive and all of the actions ought to be very fast.

These transport actions should be very quick. Let's go with a setting that defaults to 5s

masseyke · 2022-09-12T21:47:45Z

@elasticmachine run elasticsearch-ci/part-3

andreidan

Thanks Keith, left a few comments

andreidan · 2022-09-13T14:15:22Z

server/src/internalClusterTest/java/org/elasticsearch/health/HealthServiceIT.java

+            ClusterState state = internalCluster().client()
+                .admin()
+                .cluster()
+                .prepareState()
+                .clear()
+                .setMetadata(true)
+                .setNodes(true)
+                .get()
+                .getState();
+            DiscoveryNode healthNode = HealthNode.findHealthNode(state);
+            assertNotNull(healthNode);
+            Map<String, DiskHealthInfo> healthInfoCache = internalCluster().getInstance(HealthInfoCache.class, healthNode.getName())
+                .getHealthInfo()
+                .diskInfoByNode();
+            assertThat(healthInfoCache.size(), equalTo(state.getNodes().getNodes().keySet().size()));


maybe a bit of a nit but should we use the API we now have to fetch the health info?

I can do that (I assume you're talking about the FetchHealthInfoCacheAction).

andreidan · 2022-09-13T14:21:38Z

server/src/main/java/org/elasticsearch/health/HealthService.java

+                            getHealthNoHealthInfo(listener, indicatorName, preflightResults, filteredIndicators.toList(), explain);
+                        }
+                    });
+                } catch (NodeNotConnectedException | HealthNodeNotDiscoveredException e) {


I believe the async calls don't throw these exceptions right? They're propagated to the listener via onFailure

Do we need special handling there?

I believe that this happens in client.execute before it does the async bit but i'll double-check.

Looks like you're right -- the async calls don't throw these. I'll take this block out.

andreidan · 2022-09-13T14:21:53Z

server/src/main/java/org/elasticsearch/health/HealthService.java

+                        }
+                    });
+                } catch (NodeNotConnectedException | HealthNodeNotDiscoveredException e) {
+                    logger.info("Could not fetch data from health node", e);


If this catch clause is needed I think this log statement should be debug

This is needed (typically) when the master-is-stable is reporting that there is a stable master node but the master itself is null (since we say the master is stable within the last 30 seconds). I can bump down the level.

andreidan · 2022-09-13T14:22:30Z

server/src/main/java/org/elasticsearch/health/HealthService.java

+
+                        @Override
+                        public void onFailure(Exception e) {
+                            getHealthNoHealthInfo(listener, indicatorName, preflightResults, filteredIndicators.toList(), explain);


this method reads confusing to me - get health no health?

is this method needed? (it seems to differ from the combineResultsAndNotifyListener method by hardcoding EMPTY_HEALTH_INFO). Also is it intentional that the indicators are calculated iniside the method (it's quite trappy IMO - especially in contrast to combineResultsAndNotifyListener )

IMO this is quite a shallow method and we should drop it.

Duplicating a few method calls (instead of it) is fine.

What do you think?

Works for me. I was unhappy with the method name, too.

andreidan · 2022-09-13T14:25:43Z

server/src/main/java/org/elasticsearch/health/HealthService.java

+                        @Override
+                        public void onResponse(FetchHealthInfoCacheAction.Response response) {
+                            HealthInfo healthInfo = response.getHealthInfo();
+                            combineResultsAndNotifyListener(


this method signature is a bit confusing - combines the results into something else? should it return this combined result?
also, why does it need one indicator name? which one is that? Does this method do too much? (hence the confusing signature)

Ah on a closer read this method filters the results (based on indicatorName), combines them, and calls the listener.

Can we split it into a few methods ? (execute listeners, transform result(s), notify listener)

See what you think of what I've done. It's hard to pull it all apart because the indicatorName is used for validation, and we need the listener when validation fails. I've separated out the filtering of preflight results by indicator name and the combining of results though.

andreidan · 2022-09-13T14:32:08Z

server/src/main/java/org/elasticsearch/health/node/action/TransportHealthNodeAction.java

+                        actionName,
+                        request,
+                        task,
+                        TransportRequestOptions.timeout(TimeValue.timeValueSeconds(5)), // expected to be lightweight and time-sensitive


Shall we have this timeout in a setting that defaults to 5s? #89947 (comment)

andreidan · 2022-09-13T14:43:51Z

server/src/main/java/org/elasticsearch/health/HealthService.java

-    public List<HealthIndicatorResult> getHealth(@Nullable String indicatorName, boolean explain) {
+    public void getHealth(
+        Client client,
+        ActionListener<List<HealthIndicatorResult>> listener,


I believe for these APIs it's usually that we have the client first (to indicate a remote call), the business logic arguments, and last the listener

e.g. public static void closePointInTime(Client client, String pointInTimeId, ActionListener<Boolean> listener)

I'll change that. Didn't realize we had a convention.

andreidan · 2022-09-13T14:44:03Z

server/src/main/java/org/elasticsearch/health/HealthService.java

+     * @param filteredIndicatorResults The results of the non-preflight health indicators
+     */
+    private void combineResultsAndNotifyListener(
+        ActionListener<List<HealthIndicatorResult>> listener,


I believe for these APIs it's usually that we have the client first (to indicate a remote call), the business logic arguments, and last the listener

e.g. public static void closePointInTime(Client client, String pointInTimeId, ActionListener<Boolean> listener)

TIL, makes perfect sense.

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

gmarouli · 2022-09-14T08:31:03Z

server/src/main/java/org/elasticsearch/health/node/action/TransportHealthNodeAction.java

+        "health_node.transport_action_timeout",
+        TimeValue.timeValueSeconds(5),
+        TimeValue.timeValueMillis(1),
+        Setting.Property.NodeScope


I just want to test my understanding, this setting is not dynamically updated that's why we do not have a listener for this, right?

It is not dynamically updatable as I've written it. It didn't seem worth the effort to do that here since this one will probably never even be touched. What do you think?

Is there any general direction about this? I do not feel strongly about it. I am mainly thinking since this will also be called in the background regularly maybe someone would want to change it, but we can leave it as is if you think it's not worth the effort.

Let's please have it dynamic for ease of use (I think it's especially interesting to have it easily updateable while we are experimental and see how it behaves in deployments).

Something simple (a setter) like here would suffice https://github.com/elastic/elasticsearch/blob/main/modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderTaskExecutor.java#L80

server/src/test/java/org/elasticsearch/health/HealthServiceTests.java

gmarouli

Very nice progress, thanks @masseyke , I added some minor comments.

andreidan

LGTM, thanks for iterating on this Keith

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

andreidan · 2022-09-14T08:50:00Z

server/src/main/java/org/elasticsearch/health/node/action/TransportHealthNodeAction.java

+        "health_node.transport_action_timeout",
+        TimeValue.timeValueSeconds(5),
+        TimeValue.timeValueMillis(1),
+        Setting.Property.NodeScope


Shall we also make this Setting.Property.Dynamic and add a setting consumer - a setter?

I think we'll want all our health settings Dynamic( but for 8.4 might be tricky to change as they're already in the cluster state). We can discuss this separately though.

gmarouli

LGTM :) Almost there! 🚀

Updating HealthService to use FetchHealthInfoCacheAction

8d7f4d0

masseyke added >non-issue :Distributed/Health Issues for the health report API v8.5.0 labels Sep 8, 2022

masseyke mentioned this pull request Sep 8, 2022

Disk Usage health indicator #84811

Closed

9 tasks

Adding integration test

70abe1e

masseyke requested review from andreidan and gmarouli September 8, 2022 19:26

masseyke marked this pull request as ready for review September 9, 2022 13:12

elasticsearchmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Sep 9, 2022

gmarouli requested changes Sep 12, 2022

View reviewed changes

andreidan reviewed Sep 12, 2022

View reviewed changes

masseyke added 4 commits September 12, 2022 12:00

Merge branch 'main' into feature/health-api-health-service

e8b1ee1

Avoiding blocking thread waiting for health info results

0a5bb8c

fixing a dumb mistake

549cd48

Adding a timeout for TransportHealthNodeAction

526b13b

Reducing timeout from 10s to 5s

cd30549

masseyke requested review from andreidan and gmarouli September 13, 2022 13:23

andreidan reviewed Sep 13, 2022

View reviewed changes

masseyke added 4 commits September 13, 2022 11:01

code review feedback

e3917d5

code review feedback

032e386

code review feedback

cd2ab4e

removing unreachable code

3e43925

masseyke requested a review from andreidan September 13, 2022 18:06

masseyke mentioned this pull request Sep 13, 2022

Adding DiskHealthIndicatorService #90041

Merged

gmarouli reviewed Sep 14, 2022

View reviewed changes

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java Show resolved Hide resolved

gmarouli reviewed Sep 14, 2022

View reviewed changes

server/src/test/java/org/elasticsearch/health/HealthServiceTests.java Outdated Show resolved Hide resolved

gmarouli reviewed Sep 14, 2022

View reviewed changes

server/src/test/java/org/elasticsearch/health/HealthServiceTests.java Outdated Show resolved Hide resolved

gmarouli reviewed Sep 14, 2022

View reviewed changes

server/src/test/java/org/elasticsearch/health/HealthServiceTests.java Show resolved Hide resolved

gmarouli reviewed Sep 14, 2022

View reviewed changes

server/src/test/java/org/elasticsearch/health/HealthServiceTests.java Show resolved Hide resolved

gmarouli reviewed Sep 14, 2022

View reviewed changes

gmarouli self-requested a review September 14, 2022 09:20

andreidan approved these changes Sep 14, 2022

View reviewed changes

code review feedback

f38a6c8

gmarouli approved these changes Sep 14, 2022

View reviewed changes

code review feedback

49a7f11

masseyke added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 14, 2022

masseyke added 2 commits September 14, 2022 13:25

fixing integration test

ffd6c9a

explaining fix to integration test

ad2a9f9

elasticsearchmachine merged commit 9cbdc2a into elastic:main Sep 14, 2022

masseyke deleted the feature/health-api-health-service branch September 14, 2022 19:13

Conversation

masseyke commented Sep 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 9, 2022

Uh oh!

andreidan commented Sep 12, 2022

Uh oh!

gmarouli left a comment

Choose a reason for hiding this comment

Uh oh!

gmarouli Sep 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreidan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masseyke commented Sep 12, 2022

Uh oh!

andreidan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

masseyke commented Sep 8, 2022 •

edited

Loading

gmarouli Sep 12, 2022 •

edited

Loading

andreidan Sep 14, 2022 •

edited

Loading