Skip to content

Fetch health info action#89820

Merged
elasticsearchmachine merged 15 commits intoelastic:mainfrom
masseyke:feature/health-api-fetch-health-info-2
Sep 8, 2022
Merged

Fetch health info action#89820
elasticsearchmachine merged 15 commits intoelastic:mainfrom
masseyke:feature/health-api-fetch-health-info-2

Conversation

@masseyke
Copy link
Copy Markdown
Member

@masseyke masseyke commented Sep 6, 2022

This PR adds FetchHealthInfoCacheAction, which fetches the data sent to the health node by the UpdateHealthInfoCacheAction (#89275).

Relates to #84811

@masseyke masseyke added >enhancement :Distributed/Health Issues for the health report API v8.5.0 labels Sep 6, 2022
@elasticsearchmachine elasticsearchmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Sep 6, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks Keith

Can you please link this PR (and future PRs) to the disk indicator meta issue?

I'd like to have an IT for the fetching bit too
(similar to UpdateHealthInfoCacheIT).
What do you think? // @gmarouli

Copy link
Copy Markdown
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this @masseyke . Let me know what you think about the last comments, we are almost ready to ship this too!

@masseyke
Copy link
Copy Markdown
Member Author

masseyke commented Sep 7, 2022

I'd like to have an IT for the fetching bit too (similar to UpdateHealthInfoCacheIT). What do you think? // @gmarouli

What would the integration test test in this case? I thought about adding one but decided not to because the only thing beyond what FetchHealthInfoCacheActionTests is testing that be testing is TransportHealthNodeAction functionality. And that ought to have an integration test of its own (I'm not sure if it does).

@andreidan
Copy link
Copy Markdown
Contributor

andreidan commented Sep 7, 2022

What would the integration test test in this case?

If HealthInfoCache had a bug that would, say, return an empty Map in certain conditions (irrespective of live values being sent to it) we wouldn't catch it with a unit test that stubs a return value.
I think integration tests are useful to make sure the end2end flow is working

@masseyke
Copy link
Copy Markdown
Member Author

masseyke commented Sep 7, 2022

If HealthInfoCache had a bug that would, say, return an empty Map in certain conditions (irrespective of live values being sent to it) we wouldn't catch it with a unit test that stubs a return value. I think integration tests are useful to make sure the end2end flow is working

Wouldn't that be really easy to test in a unit test? And really hard to test in an integration test (because I'm guessing that HealthInfoCache doesn't have such a bug and if it did we'd just fix it)? I'm just trying to get at what would be useful and meaningful to test in an integration test that might protect us from future bugs in this code.

@gmarouli gmarouli mentioned this pull request Sep 7, 2022
9 tasks
@andreidan
Copy link
Copy Markdown
Contributor

@masseyke I think the workflow for health reporting is rather complex (many moving parts in terms of reporting, caching, and fetching) and would benefit for as much testing as possible.

@masseyke
Copy link
Copy Markdown
Member Author

masseyke commented Sep 7, 2022

@elasticsearchmachine run elasticsearch-ci/part-1

@masseyke
Copy link
Copy Markdown
Member Author

masseyke commented Sep 7, 2022

@elasticmachine run elasticsearch-ci/part-1

@masseyke
Copy link
Copy Markdown
Member Author

masseyke commented Sep 7, 2022

@elasticmachine run elasticsearch-ci/bwc

@gmarouli
Copy link
Copy Markdown
Contributor

gmarouli commented Sep 8, 2022

I agree with @andreidan , maybe an integration test doesn't have to simulate a weird edge case but the edge case might present itself. What we are trying to achieve with such an integration test is to test a situation as close to reality as possible and effectively test our assumptions. @masseyke what do you think? Does this make sense?

@masseyke
Copy link
Copy Markdown
Member Author

masseyke commented Sep 8, 2022

I strongly disagree with both of you, but I added several integration tests yesterday.

@masseyke masseyke requested a review from gmarouli September 8, 2022 12:33
@gmarouli
Copy link
Copy Markdown
Contributor

gmarouli commented Sep 8, 2022

That doesn't feel very good to me, you are leading this PR I wouldn't want you to do something you strongly disagree :). I would like to understand better your objection. Do you think that there is no value to have an integration test for this as a sanity check, or that you were planning to add it in the next PR that includes the disk indicator where it would test a more end to end scenario?

@masseyke
Copy link
Copy Markdown
Member Author

masseyke commented Sep 8, 2022

The reason is that there is nothing to really integration test here that's not already covered elsewhere. It doesn't tell us anything new or mitigate any risk. There's no complex interaction in the code in this PR -- unlike the code that runs automatically on all nodes to generate and upload the health data, this just runs on a single node when someone calls it (and doesn't even transform the data in any way). Assuming TransportHealthNodeAction has been thoroughly tested and the unit test for FetchHealthInfoCacheAction is decent, I haven't figured out how to mitigate any additional risk with an integration test (other than maybe that HealthInfoCache can't be injected).
Each test we write takes time and money to run (both in our CI system and for the hundreds of people building the code on their own), and integration tests are especially expensive. If they mitigate any risk at all, they're probably worth the cost. I just haven't been able to understand what risk we're targeting here. In order to minimize the cost, I've piggybacked the FetchHealthInfoCacheAction tests onto UpdateHealthInfoCacheIT (since they're complimentary actions and we'd probably want to test the same situations for both without copy/pasting lots of code).

@@ -121,12 +135,14 @@ public void testMasterFailure() throws Exception {
assertThat(newHealthNode, notNullValue());
assertBusy(() -> {
Map<String, DiskHealthInfo> healthInfoCache = internalCluster.getInstance(HealthInfoCache.class, newHealthNode.getName())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should replace this check all together, and compare the info we get from the action you introduced. What do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense (I just pushed that).

@masseyke masseyke requested a review from gmarouli September 8, 2022 13:47
@masseyke
Copy link
Copy Markdown
Member Author

masseyke commented Sep 8, 2022

@elasticmachine update branch

@gmarouli
Copy link
Copy Markdown
Contributor

gmarouli commented Sep 8, 2022

Thank you for sharing your view on the matter. I like your proposal to piggy back on the UpdateHealthInfoCacheIT and I would suggest to replace the class retrieval by the FetchHealthInfoCacheAction. Code wise it's much nicer to use this action than try to find the correct node to get the instance from a class. Is this approach something that you feel comfortable with? We reduce the costs and we use the action to retrieve the data for the assertions.

@masseyke
Copy link
Copy Markdown
Member Author

masseyke commented Sep 8, 2022

@elasticmachine run elasticsearch-ci/part-1

Copy link
Copy Markdown
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few minor comments, but it looks good thanks for carrying it through. I hope you are happy with the result too.

* @param notExpectedNodeId A nullable nodeId that we do _not_ expect to see in the results
* @throws Exception
*/
private void assertResultsCanBeFetched(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify this what if we pick a node randomly? Isn't this good enough?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what it's doing right? It picks a random non-health node, and then also the health node (to cover the 2 main cases).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does both, I was wondering since as you said the routing is tested in the TransportHealthNodeAction, that maybe it's sufficient to just pick a node and fetch the data once. Without differentiating between the health and other nodes. WDYT?

…asseyke/elasticsearch into feature/health-api-fetch-health-info-2
@masseyke masseyke added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 8, 2022
@masseyke
Copy link
Copy Markdown
Member Author

masseyke commented Sep 8, 2022

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine merged commit 691c08a into elastic:main Sep 8, 2022
@masseyke masseyke deleted the feature/health-api-fetch-health-info-2 branch September 8, 2022 18:06
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 9, 2022
* main: (34 commits)
  Make sure ivy repo directory exists before downloading artifacts
  Use 'file://' scheme for local repository URL
  Use DRA artifacts for release build CI jobs
  Log unsuccessful attempts to get credentials from web identity tokens (elastic#88241)
  Script: Write Field API path manipulation (elastic#89889)
  Fetch health info action (elastic#89820)
  Fix memory leak in TransportDeleteExpiredDataAction (elastic#89935)
  [ML] Performance improvements for categorization jobs (elastic#89824)
  [DOCS] Revert changes for ES_JAVA_OPTS (elastic#89931)
  Fix deadlock bug exposed by a test (elastic#89934)
  [Downsampling] Remove `FieldValueFetcher` validator (elastic#89497)
  Fix segment stats in tsdb (elastic#89754)
  Synthetic _source: support dense_vector (elastic#89840)
  REST tests fetching fields with synthetic _source (elastic#89888)
  Do not deserialize back BytesTransportRequest to clone a request in MockTransportService (elastic#89926)
  Add SDK request logging to debug failures of S3BlobStoreRepositoryTests#testRequestStats (elastic#89912)
  Fix SnapshotStatusApisIT.testGetSnapshotsWithSnapshotInProgress (elastic#89925)
  Document synthetic source for text and keyword (elastic#89893)
  Fix CloneSnapshotIT.testRemoveFailedCloneFromCSWithQueuedSnapshotInProgress (elastic#89914)
  Add missing index.mapping.total_fields.limit setting to the target index (elastic#89875)
  ...
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 9, 2022
* main: (176 commits)
  Fix RandomSamplerAggregatorTests testAggregationSamplingNestedAggsScaled test failure (elastic#89958)
  [Downsampling] Replace document map with SMILE encoded doc (elastic#89495)
  Remove full cluster state from error logging in MasterService (elastic#89960)
  [ML] Truncate categorization fields (elastic#89827)
  [TSDB] Removed `summary` and `histogram` metric types (elastic#89937)
  Update testNodeSelectorRouting so that it does not depend on iteration order (elastic#89879)
  Make sure listener is resolved when file queue is cleared (elastic#89929)
  [Stable plugin api] Extensible annotation (elastic#89903)
  Fix double sending of response in TransportOpenIdConnectPrepareAuthenticationAction (elastic#89930)
  Make sure ivy repo directory exists before downloading artifacts
  Use 'file://' scheme for local repository URL
  Use DRA artifacts for release build CI jobs
  Log unsuccessful attempts to get credentials from web identity tokens (elastic#88241)
  Script: Write Field API path manipulation (elastic#89889)
  Fetch health info action (elastic#89820)
  Fix memory leak in TransportDeleteExpiredDataAction (elastic#89935)
  [ML] Performance improvements for categorization jobs (elastic#89824)
  [DOCS] Revert changes for ES_JAVA_OPTS (elastic#89931)
  Fix deadlock bug exposed by a test (elastic#89934)
  [Downsampling] Remove `FieldValueFetcher` validator (elastic#89497)
  ...
elasticsearchmachine pushed a commit that referenced this pull request Sep 14, 2022
This PR updates HealthService to fetch HealthInfo data using
FetchHealthInfoCacheAction (#89820), and to pass it to the
HealthIndicatorServices' calculate method. This will allow
HealthIndicatorServices to use any data from the health node's
HealthInfoCache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Health Issues for the health report API >non-issue Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants