Skip to content

Persistent health task#86131

Merged
gmarouli merged 57 commits intoelastic:masterfrom
gmarouli:persistent-health-task
Jun 16, 2022
Merged

Persistent health task#86131
gmarouli merged 57 commits intoelastic:masterfrom
gmarouli:persistent-health-task

Conversation

@gmarouli
Copy link
Copy Markdown
Contributor

@gmarouli gmarouli commented Apr 25, 2022

The health API, has the requirement to be as quick and as cheap as possible. To achieve this we want to have one node responsible to keep the health state of all the node of the cluster without querying all the nodes in the spot. We do not want to burden a dedicating master with this, so we are going to use a node that can contain data for this responsibility.

Health role
We discussed introducing a health role, but it doesn't seem necessary since the only requirement is that the node responsible for collecting the health states is that it contains data.

The selection of the health node happens via the persistent task framework. We introduce a new persistent task and whichever node is assigned to it, is the selected health node. If the selected health node is marked for shutdown via the Shutdown API we release the task so the master can assign it to a new node.

This PR is introducing the persistent task behind a feature flag.

Relates to #84811

@gmarouli gmarouli added >non-issue :Distributed/Health Issues for the health report API labels Apr 25, 2022
gmarouli added a commit that referenced this pull request May 4, 2022
The tests affected in this PR are expecting that the tasks related to the test are the only tasks running. This assumption will not hold soon. For this reason, we refactor the tests so that the assertions are handling only the tasks that are related to the test.

There is only one exception BaseMlIntegTestCase, in this case it was easier to exclude unrelated tasks than to include all the possible tasks created as a side effect of the ML jobs. The constant UNRELATED_TASKS enabled developers to add tasks that need to be excluded, for example the new task we are introducing in #86131.
Co-authored-by: David Turner <david.turner@elastic.co>
@gmarouli
Copy link
Copy Markdown
Contributor Author

gmarouli commented Jun 8, 2022

@elasticmachine update branch

@gmarouli
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

@gmarouli gmarouli requested review from andreidan and dakrone June 13, 2022 09:20
@gmarouli
Copy link
Copy Markdown
Contributor Author

Regarding the failing tests that were extending ESSingleNodeTestCase. We confirmed that both tests could not tolerate any changes on the cluster state during the test:

  • The ClusterStateApiTests because it was requesting the next version of the cluster state which hasn't happened yet, expecting to time out. If the health task was assigned during this time, then it would not timeout.
  • The IndicesServiceTests because it was set-up as a unit test to test IndicesService. It would simulate changes in the IndicesService but it wouldn't propagate the changes to the cluster state. If a cluster state change would happen during the test, then the indices in IndicesService did not exist yet in the cluster state which would fail the respective assertion.

After discussion with the team we discussed that tests that extend ESSingleNodeTestCase and use it as a unit test because mocking the environment would be too hard, are allowed to disable the health node to not add complexity.

Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Mary! I think this looks great
Left a few minor suggestions but this is generally great

if (event.state().nodes().getLocalNode().isMasterNode() == false || HealthNodeSelector.findTask(event.state()) != null) {
clusterService.removeListener(taskStarter);
} else if (event.localNodeMaster()) {
clusterService.removeListener(taskStarter);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems we're removing the task starter on both branches of this if/else clause. Could this be simplified?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a shot, let me know if it's clear as well as simplified.


import static org.elasticsearch.health.node.selection.HealthNodeSelector.TASK_NAME;

class HealthNodeSelectorTaskParams implements PersistentTaskParams {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test can override assertEqualInstances to make it match our new possible invariant? (ie. same instance?)

Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good to me! I left a couple of comments, but I'll defer the final review to Andrei et al.

@gmarouli
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

@gmarouli gmarouli requested a review from andreidan June 15, 2022 11:20
Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating on this Mary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Health Issues for the health report API >non-issue Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v8.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants