Skip to content

Updating HealthService to use FetchHealthInfoCacheAction#89947

Merged
elasticsearchmachine merged 15 commits intoelastic:mainfrom
masseyke:feature/health-api-health-service
Sep 14, 2022
Merged

Updating HealthService to use FetchHealthInfoCacheAction#89947
elasticsearchmachine merged 15 commits intoelastic:mainfrom
masseyke:feature/health-api-health-service

Conversation

@masseyke
Copy link
Copy Markdown
Member

@masseyke masseyke commented Sep 8, 2022

This PR updates HealthService to fetch HealthInfo data using FetchHealthInfoCacheAction (#89820), and to pass it to the HealthIndicatorServices' calculate method. This will allow HealthIndicatorServices to use any data from the health node's HealthInfoCache.

@masseyke masseyke added >non-issue :Distributed/Health Issues for the health report API v8.5.0 labels Sep 8, 2022
@masseyke masseyke mentioned this pull request Sep 8, 2022
9 tasks
@masseyke masseyke marked this pull request as ready for review September 9, 2022 13:12
@elasticsearchmachine elasticsearchmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Sep 9, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@andreidan
Copy link
Copy Markdown
Contributor

Relates to #84811

Copy link
Copy Markdown
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow the last part of the disk indicator!! How exciting, thank you for putting this together @masseyke . I did quick pass and I had two comments that might have quite some impact so after we make some progress on those I will do a more detailed pass. Let me know if I can help :).

FetchHealthInfoCacheAction.INSTANCE,
new FetchHealthInfoCacheAction.Request()
);
FetchHealthInfoCacheAction.Response response = responseActionFuture.actionGet();
Copy link
Copy Markdown
Contributor

@gmarouli gmarouli Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rework this to not use a blocking call? I think it should be possible. Let me know if I can help.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want this to be a blocking call right? What do you have in mind?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I misunderstood you. I've changed the code so that it no longer blocks the thread (but still blocks on responding to the user).

Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Keith

Left a comment

FetchHealthInfoCacheAction.INSTANCE,
new FetchHealthInfoCacheAction.Request()
);
FetchHealthInfoCacheAction.Response response = responseActionFuture.actionGet();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things here:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first bullet point, this PR will address this: #90003.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the change for the 2nd. The timeout is set (well currently not set) in TransportHealthNodeAction. Does 10s sound reasonable for all health node actions? Or do we need to make this configurable differently for different actions? I would think 10s would be fine because all of this data is fairly time-sensitive and all of the actions ought to be very fast.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These transport actions should be very quick. Let's go with a setting that defaults to 5s

@masseyke
Copy link
Copy Markdown
Member Author

@elasticmachine run elasticsearch-ci/part-3

Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Keith, left a few comments

Comment on lines +89 to +103
ClusterState state = internalCluster().client()
.admin()
.cluster()
.prepareState()
.clear()
.setMetadata(true)
.setNodes(true)
.get()
.getState();
DiscoveryNode healthNode = HealthNode.findHealthNode(state);
assertNotNull(healthNode);
Map<String, DiskHealthInfo> healthInfoCache = internalCluster().getInstance(HealthInfoCache.class, healthNode.getName())
.getHealthInfo()
.diskInfoByNode();
assertThat(healthInfoCache.size(), equalTo(state.getNodes().getNodes().keySet().size()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a bit of a nit but should we use the API we now have to fetch the health info?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that (I assume you're talking about the FetchHealthInfoCacheAction).

getHealthNoHealthInfo(listener, indicatorName, preflightResults, filteredIndicators.toList(), explain);
}
});
} catch (NodeNotConnectedException | HealthNodeNotDiscoveredException e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the async calls don't throw these exceptions right? They're propagated to the listener via onFailure

Do we need special handling there?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this happens in client.execute before it does the async bit but i'll double-check.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're right -- the async calls don't throw these. I'll take this block out.

}
});
} catch (NodeNotConnectedException | HealthNodeNotDiscoveredException e) {
logger.info("Could not fetch data from health node", e);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this catch clause is needed I think this log statement should be debug

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed (typically) when the master-is-stable is reporting that there is a stable master node but the master itself is null (since we say the master is stable within the last 30 seconds). I can bump down the level.


@Override
public void onFailure(Exception e) {
getHealthNoHealthInfo(listener, indicatorName, preflightResults, filteredIndicators.toList(), explain);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method reads confusing to me - get health no health?

is this method needed? (it seems to differ from the combineResultsAndNotifyListener method by hardcoding EMPTY_HEALTH_INFO). Also is it intentional that the indicators are calculated iniside the method (it's quite trappy IMO - especially in contrast to combineResultsAndNotifyListener )

IMO this is quite a shallow method and we should drop it.

Duplicating a few method calls (instead of it) is fine.

What do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. I was unhappy with the method name, too.

@Override
public void onResponse(FetchHealthInfoCacheAction.Response response) {
HealthInfo healthInfo = response.getHealthInfo();
combineResultsAndNotifyListener(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method signature is a bit confusing - combines the results into something else? should it return this combined result?
also, why does it need one indicator name? which one is that? Does this method do too much? (hence the confusing signature)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah on a closer read this method filters the results (based on indicatorName), combines them, and calls the listener.

Can we split it into a few methods ? (execute listeners, transform result(s), notify listener)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See what you think of what I've done. It's hard to pull it all apart because the indicatorName is used for validation, and we need the listener when validation fails. I've separated out the filtering of preflight results by indicator name and the combining of results though.

actionName,
request,
task,
TransportRequestOptions.timeout(TimeValue.timeValueSeconds(5)), // expected to be lightweight and time-sensitive
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we have this timeout in a setting that defaults to 5s? #89947 (comment)

public List<HealthIndicatorResult> getHealth(@Nullable String indicatorName, boolean explain) {
public void getHealth(
Client client,
ActionListener<List<HealthIndicatorResult>> listener,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe for these APIs it's usually that we have the client first (to indicate a remote call), the business logic arguments, and last the listener

e.g. public static void closePointInTime(Client client, String pointInTimeId, ActionListener<Boolean> listener)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change that. Didn't realize we had a convention.

* @param filteredIndicatorResults The results of the non-preflight health indicators
*/
private void combineResultsAndNotifyListener(
ActionListener<List<HealthIndicatorResult>> listener,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe for these APIs it's usually that we have the client first (to indicate a remote call), the business logic arguments, and last the listener

e.g. public static void closePointInTime(Client client, String pointInTimeId, ActionListener<Boolean> listener)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, makes perfect sense.

"health_node.transport_action_timeout",
TimeValue.timeValueSeconds(5),
TimeValue.timeValueMillis(1),
Setting.Property.NodeScope
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to test my understanding, this setting is not dynamically updated that's why we do not have a listener for this, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not dynamically updatable as I've written it. It didn't seem worth the effort to do that here since this one will probably never even be touched. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any general direction about this? I do not feel strongly about it. I am mainly thinking since this will also be called in the background regularly maybe someone would want to change it, but we can leave it as is if you think it's not worth the effort.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please have it dynamic for ease of use (I think it's especially interesting to have it easily updateable while we are experimental and see how it behaves in deployments).

Something simple (a setter) like here would suffice https://github.com/elastic/elasticsearch/blob/main/modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderTaskExecutor.java#L80

Copy link
Copy Markdown
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice progress, thanks @masseyke , I added some minor comments.

@gmarouli gmarouli self-requested a review September 14, 2022 09:20
Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating on this Keith

"health_node.transport_action_timeout",
TimeValue.timeValueSeconds(5),
TimeValue.timeValueMillis(1),
Setting.Property.NodeScope
Copy link
Copy Markdown
Contributor

@andreidan andreidan Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also make this Setting.Property.Dynamic and add a setting consumer - a setter?

I think we'll want all our health settings Dynamic( but for 8.4 might be tricky to change as they're already in the cluster state). We can discuss this separately though.

Copy link
Copy Markdown
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :) Almost there! 🚀

@masseyke masseyke added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 14, 2022
@elasticsearchmachine elasticsearchmachine merged commit 9cbdc2a into elastic:main Sep 14, 2022
@masseyke masseyke deleted the feature/health-api-health-service branch September 14, 2022 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Health Issues for the health report API >non-issue Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants