Updating HealthService to use FetchHealthInfoCacheAction#89947
Updating HealthService to use FetchHealthInfoCacheAction#89947elasticsearchmachine merged 15 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-data-management (Team:Data Management) |
|
Relates to #84811 |
gmarouli
left a comment
There was a problem hiding this comment.
Wow the last part of the disk indicator!! How exciting, thank you for putting this together @masseyke . I did quick pass and I had two comments that might have quite some impact so after we make some progress on those I will do a more detailed pass. Let me know if I can help :).
| FetchHealthInfoCacheAction.INSTANCE, | ||
| new FetchHealthInfoCacheAction.Request() | ||
| ); | ||
| FetchHealthInfoCacheAction.Response response = responseActionFuture.actionGet(); |
There was a problem hiding this comment.
Can you rework this to not use a blocking call? I think it should be possible. Let me know if I can help.
There was a problem hiding this comment.
I think we want this to be a blocking call right? What do you have in mind?
There was a problem hiding this comment.
Sorry I misunderstood you. I've changed the code so that it no longer blocks the thread (but still blocks on responding to the user).
andreidan
left a comment
There was a problem hiding this comment.
Thanks for working on this Keith
Left a comment
| FetchHealthInfoCacheAction.INSTANCE, | ||
| new FetchHealthInfoCacheAction.Request() | ||
| ); | ||
| FetchHealthInfoCacheAction.Response response = responseActionFuture.actionGet(); |
There was a problem hiding this comment.
A few things here:
- the "fetch health request" will not create a cancellable task (similar to https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/action/admin/indices/diskusage/AnalyzeDiskUsageShardRequest.java#L42 )
- we shouldn't be blocking the calling thread here - so let's use the async call (listener etc)
- we should add a timeout here as the response to the health API should be prompt (also the timeout will be used as an indication that we don't have health information)
There was a problem hiding this comment.
For the first bullet point, this PR will address this: #90003.
There was a problem hiding this comment.
I've made the change for the 2nd. The timeout is set (well currently not set) in TransportHealthNodeAction. Does 10s sound reasonable for all health node actions? Or do we need to make this configurable differently for different actions? I would think 10s would be fine because all of this data is fairly time-sensitive and all of the actions ought to be very fast.
There was a problem hiding this comment.
These transport actions should be very quick. Let's go with a setting that defaults to 5s
|
@elasticmachine run elasticsearch-ci/part-3 |
andreidan
left a comment
There was a problem hiding this comment.
Thanks Keith, left a few comments
| ClusterState state = internalCluster().client() | ||
| .admin() | ||
| .cluster() | ||
| .prepareState() | ||
| .clear() | ||
| .setMetadata(true) | ||
| .setNodes(true) | ||
| .get() | ||
| .getState(); | ||
| DiscoveryNode healthNode = HealthNode.findHealthNode(state); | ||
| assertNotNull(healthNode); | ||
| Map<String, DiskHealthInfo> healthInfoCache = internalCluster().getInstance(HealthInfoCache.class, healthNode.getName()) | ||
| .getHealthInfo() | ||
| .diskInfoByNode(); | ||
| assertThat(healthInfoCache.size(), equalTo(state.getNodes().getNodes().keySet().size())); |
There was a problem hiding this comment.
maybe a bit of a nit but should we use the API we now have to fetch the health info?
There was a problem hiding this comment.
I can do that (I assume you're talking about the FetchHealthInfoCacheAction).
| getHealthNoHealthInfo(listener, indicatorName, preflightResults, filteredIndicators.toList(), explain); | ||
| } | ||
| }); | ||
| } catch (NodeNotConnectedException | HealthNodeNotDiscoveredException e) { |
There was a problem hiding this comment.
I believe the async calls don't throw these exceptions right? They're propagated to the listener via onFailure
Do we need special handling there?
There was a problem hiding this comment.
I believe that this happens in client.execute before it does the async bit but i'll double-check.
There was a problem hiding this comment.
Looks like you're right -- the async calls don't throw these. I'll take this block out.
| } | ||
| }); | ||
| } catch (NodeNotConnectedException | HealthNodeNotDiscoveredException e) { | ||
| logger.info("Could not fetch data from health node", e); |
There was a problem hiding this comment.
If this catch clause is needed I think this log statement should be debug
There was a problem hiding this comment.
This is needed (typically) when the master-is-stable is reporting that there is a stable master node but the master itself is null (since we say the master is stable within the last 30 seconds). I can bump down the level.
|
|
||
| @Override | ||
| public void onFailure(Exception e) { | ||
| getHealthNoHealthInfo(listener, indicatorName, preflightResults, filteredIndicators.toList(), explain); |
There was a problem hiding this comment.
this method reads confusing to me - get health no health?
is this method needed? (it seems to differ from the combineResultsAndNotifyListener method by hardcoding EMPTY_HEALTH_INFO). Also is it intentional that the indicators are calculated iniside the method (it's quite trappy IMO - especially in contrast to combineResultsAndNotifyListener )
IMO this is quite a shallow method and we should drop it.
Duplicating a few method calls (instead of it) is fine.
What do you think?
There was a problem hiding this comment.
Works for me. I was unhappy with the method name, too.
| @Override | ||
| public void onResponse(FetchHealthInfoCacheAction.Response response) { | ||
| HealthInfo healthInfo = response.getHealthInfo(); | ||
| combineResultsAndNotifyListener( |
There was a problem hiding this comment.
this method signature is a bit confusing - combines the results into something else? should it return this combined result?
also, why does it need one indicator name? which one is that? Does this method do too much? (hence the confusing signature)
There was a problem hiding this comment.
Ah on a closer read this method filters the results (based on indicatorName), combines them, and calls the listener.
Can we split it into a few methods ? (execute listeners, transform result(s), notify listener)
There was a problem hiding this comment.
See what you think of what I've done. It's hard to pull it all apart because the indicatorName is used for validation, and we need the listener when validation fails. I've separated out the filtering of preflight results by indicator name and the combining of results though.
| actionName, | ||
| request, | ||
| task, | ||
| TransportRequestOptions.timeout(TimeValue.timeValueSeconds(5)), // expected to be lightweight and time-sensitive |
There was a problem hiding this comment.
Shall we have this timeout in a setting that defaults to 5s? #89947 (comment)
| public List<HealthIndicatorResult> getHealth(@Nullable String indicatorName, boolean explain) { | ||
| public void getHealth( | ||
| Client client, | ||
| ActionListener<List<HealthIndicatorResult>> listener, |
There was a problem hiding this comment.
I believe for these APIs it's usually that we have the client first (to indicate a remote call), the business logic arguments, and last the listener
e.g. public static void closePointInTime(Client client, String pointInTimeId, ActionListener<Boolean> listener)
There was a problem hiding this comment.
I'll change that. Didn't realize we had a convention.
| * @param filteredIndicatorResults The results of the non-preflight health indicators | ||
| */ | ||
| private void combineResultsAndNotifyListener( | ||
| ActionListener<List<HealthIndicatorResult>> listener, |
There was a problem hiding this comment.
I believe for these APIs it's usually that we have the client first (to indicate a remote call), the business logic arguments, and last the listener
e.g. public static void closePointInTime(Client client, String pointInTimeId, ActionListener<Boolean> listener)
There was a problem hiding this comment.
TIL, makes perfect sense.
...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java
Show resolved
Hide resolved
| "health_node.transport_action_timeout", | ||
| TimeValue.timeValueSeconds(5), | ||
| TimeValue.timeValueMillis(1), | ||
| Setting.Property.NodeScope |
There was a problem hiding this comment.
I just want to test my understanding, this setting is not dynamically updated that's why we do not have a listener for this, right?
There was a problem hiding this comment.
It is not dynamically updatable as I've written it. It didn't seem worth the effort to do that here since this one will probably never even be touched. What do you think?
There was a problem hiding this comment.
Is there any general direction about this? I do not feel strongly about it. I am mainly thinking since this will also be called in the background regularly maybe someone would want to change it, but we can leave it as is if you think it's not worth the effort.
There was a problem hiding this comment.
Let's please have it dynamic for ease of use (I think it's especially interesting to have it easily updateable while we are experimental and see how it behaves in deployments).
Something simple (a setter) like here would suffice https://github.com/elastic/elasticsearch/blob/main/modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderTaskExecutor.java#L80
server/src/test/java/org/elasticsearch/health/HealthServiceTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/health/HealthServiceTests.java
Outdated
Show resolved
Hide resolved
andreidan
left a comment
There was a problem hiding this comment.
LGTM, thanks for iterating on this Keith
...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java
Show resolved
Hide resolved
| "health_node.transport_action_timeout", | ||
| TimeValue.timeValueSeconds(5), | ||
| TimeValue.timeValueMillis(1), | ||
| Setting.Property.NodeScope |
There was a problem hiding this comment.
Shall we also make this Setting.Property.Dynamic and add a setting consumer - a setter?
I think we'll want all our health settings Dynamic( but for 8.4 might be tricky to change as they're already in the cluster state). We can discuss this separately though.
gmarouli
left a comment
There was a problem hiding this comment.
LGTM :) Almost there! 🚀
This PR updates HealthService to fetch HealthInfo data using FetchHealthInfoCacheAction (#89820), and to pass it to the HealthIndicatorServices' calculate method. This will allow HealthIndicatorServices to use any data from the health node's HealthInfoCache.