Skip to content

Add disk thresholds in the cluster state#88175

Merged
gmarouli merged 26 commits intoelastic:masterfrom
gmarouli:health-disk-metadata-in-cluster-state
Jul 7, 2022
Merged

Add disk thresholds in the cluster state#88175
gmarouli merged 26 commits intoelastic:masterfrom
gmarouli:health-disk-metadata-in-cluster-state

Conversation

@gmarouli
Copy link
Copy Markdown
Contributor

@gmarouli gmarouli commented Jun 29, 2022

Problem statement
For a data node, we use the watermarks to determine if a node's disk usage is healthy. The watermarks can be configured in different ways and it's possible that each node has a different watermark configuration. This is not desirable, we want to use the same thresholds for all nodes and specifically the ones that the master is using.

Proposal
When a node is a elected as master, it will add a custom metadata to the cluster state that will describe these thresholds. For example:

    "health": {
      "disk": {
        "low_watermark": "85%",
        "high_watermark": "90%",
        "flood_stage_watermark": "95%",
        "frozen_flood_stage_watermark": "95%",
        "frozen_flood_stage_max_headroom": "20gb"
      }
    }

In this PR, we introduce the health metadata and we wire the existing disk thresholds to update the health metadata in the cluster state.

Part of #84811

@gmarouli
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/part-2

@gmarouli gmarouli marked this pull request as ready for review June 29, 2022 13:15
@gmarouli gmarouli requested review from andreidan and dakrone June 29, 2022 13:15
@gmarouli
Copy link
Copy Markdown
Contributor Author

@dakrone referring to #87975 (review)

I think there is a misunderstanding. The health metadata are not supposed to be node specific, we intend them to be the same for all the nodes of the cluster and determined by the master node.

Effectively, that's what is happening right when ti comes to allocation too. All nodes have disk threshold settings (potentially different ones), but the elected master node will use their own in the allocation decider. In a similar way, we want the master node to propagate the thresholds so every node can check their disk usage and report back using the same thresholds.

Is this more clear now?

@gmarouli gmarouli mentioned this pull request Jun 29, 2022
9 tasks
@gmarouli gmarouli added >non-issue :Distributed/Health Issues for the health report API labels Jun 29, 2022
@elasticmachine elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Jun 29, 2022
Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Mary.

I've left a few suggestions and questions

@gmarouli
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

@gmarouli
Copy link
Copy Markdown
Contributor Author

gmarouli commented Jul 4, 2022

@elasticmachine update branch

@gmarouli gmarouli requested a review from andreidan July 4, 2022 08:53
Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating on this Mary

@gmarouli
Copy link
Copy Markdown
Contributor Author

gmarouli commented Jul 7, 2022

@elasticmachine update branch

@gmarouli
Copy link
Copy Markdown
Contributor Author

gmarouli commented Jul 7, 2022

@elasticmachine update branch

@gmarouli
Copy link
Copy Markdown
Contributor Author

gmarouli commented Jul 7, 2022

Dotting the i's and crossing the t's

  • Removed the low watermark from the HealthMetadata, we do not use it in the disk health calculation so there is no need to store it there.
  • Reverted the DiskThresholdSettingParser in course of this PR it became unnecessary (at least for now) to decouple the settings and their parsing.

@gmarouli
Copy link
Copy Markdown
Contributor Author

gmarouli commented Jul 7, 2022

@elasticmachine update branch

@gmarouli gmarouli merged commit 4834965 into elastic:master Jul 7, 2022
@gmarouli gmarouli deleted the health-disk-metadata-in-cluster-state branch July 7, 2022 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Health Issues for the health report API >non-issue Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v8.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants