Adding DiskHealthIndicatorService by masseyke · Pull Request #90041 · elastic/elasticsearch

masseyke · 2022-09-13T19:41:13Z

This PR adds a new disk health indicator. The output looks something like this:

        "disk_health": {
            "status": "red",
            "symptom": "2 nodes with roles: [data_hot, master] are out of disk space.",
            "details": {
                "disk": [
                    {
                        "node_id": "YlPqjTs2Qq2zROx8U7CamA",
                        "name": "node-1",
                        "status": "UNKNOWN"
                    },
                    {
                        "node_id": "ssbKOKXHQteYctBB9kGCqg",
                        "name": "node-2",
                        "status": "RED"
                    }
                ]
            },
            "impacts": [
                {
                    "severity": 1,
                    "description": "Cannot insert or update documents in the affected indices.",
                    "impact_areas": [
                        "ingest"
                    ]
                },
                {
                    "severity": 2,
                    "description": "Cluster stability might be impaired.",
                    "impact_areas": [
                        "deployment_management"
                    ]
                }
            ],
            "diagnosis": [
                {
                    "cause": "3 indices reside on nodes that have run out of space and writing has been blocked by the system.",
                    "action": "Enable autoscaling (if applicable), add disk capacity or free up disk space to resolve this. If you have already taken action please wait for the rebalancing to complete.",
                    "affected_resources": [
                        "YlPqjTs2Qq2zROx8U7CamA",
                        "ssbKOKXHQteYctBB9kGCqg"
                    ],
                    "help_url": "https://ela.st/free-disk-space-or-add-capacity-data-nodes"
                },
                {
                    "cause": "Disk is almost full.",
                    "action": "Please add capacity to the current nodes, or replace them with ones with higher capacity.",
                    "affected_resources": [
                        "YlPqjTs2Qq2zROx8U7CamA"
                    ],
                    "help_url": "https://ela.st/free-disk-space-or-add-capacity-master-nodes"
                }
            ]
        },

There are 3 classes of problems in this indicator: (1) an index block / data node problem, (2) a node with master role is filling up, and (3) any other node (besides master or data) is filling up. There will be at most 3 impacts and 3 diagnoses -- one for each of these classes of problems, regardless of whether the status of any node is red or yellow.
For example say we have a 5 node cluster:

Node	Roles	Status	Indices
node1	master, data_hot	GREEN	index1
node2	master, data_hot	GREEN	index2
node3	master, data_hot	YELLOW	index3(*)
node4	data_hot	YELLOW	index3(*)
node5	data_hot	RED	index2

Say we have an index that has a block on it in the cluster state (index3), and that index lives on node3 and node4. Let's say that even though the lock is still in the cluster state node 3 and node4 have fallen below the flood stage so they return YELLOW. And node5 is over the flood limit so it returns RED. Then we'd expect the overall status to be RED (both because of the blocked index and the RED node). We'd expect the symptom to list problems with nodes 3, 4, and 5. And we'd expect two impacts and two diagnoses -- one each for the index / data node problem and the master node problem. The result might look like this:

        "disk_health": {
            "status": "red",
            "symptom": "3 nodes with roles: [data_hot, master] are out of disk space.",
            "details": {
                "disk": [
                    {
                        "node_id": "LQqafXKMQcmAZQU9AG2shQ",
                        "name": "node-1",
                        "status": "GREEN"
                    },
                    {
                        "node_id": "sqkFF0HeRVm00_oDJdhZiA",
                        "name": "node-2",
                        "status": "GREEN"
                    },
                    {
                        "node_id": "YlPqjTs2Qq2zROx8U7CamA",
                        "name": "node-3",
                        "status": "YELLOW"
                    },
                    {
                        "node_id": "ssbKOKXHQteYctBB9kGCqg",
                        "name": "node-4",
                        "status": "YELLOW"
                    },
                    {
                        "node_id": "KvX6n7AGQdOX9ELmeGNqKw",
                        "name": "node-5",
                        "status": "RED"
                    }
                ]
            },
            "impacts": [
                {
                    "severity": 1,
                    "description": "Cannot insert or update documents in the affected indices.",
                    "impact_areas": [
                        "ingest"
                    ]
                },
                {
                    "severity": 2,
                    "description": "Cluster stability might be impaired.",
                    "impact_areas": [
                        "deployment_management"
                    ]
                }
            ],
            "diagnosis": [
                {
                    "cause": "2 indices reside on nodes that have run out of space and writing has been blocked by the system.",
                    "action": "Enable autoscaling (if applicable), add disk capacity or free up disk space to resolve this. If you have already taken action please wait for the rebalancing to complete.",
                    "affected_resources": [
                        "YlPqjTs2Qq2zROx8U7CamA",
                        "ssbKOKXHQteYctBB9kGCqg",
                        "KvX6n7AGQdOX9ELmeGNqKw"
                    ],
                    "help_url": "https://ela.st/free-disk-space-or-add-capacity-data-nodes"
                },
                {
                    "cause": "Disk is almost full.",
                    "action": "Please add capacity to the current nodes, or replace them with ones with higher capacity.",
                    "affected_resources": [
                        "YlPqjTs2Qq2zROx8U7CamA"
                    ],
                    "help_url": "https://ela.st/free-disk-space-or-add-capacity-master-nodes"
                }
            ]
        },

…, and combining index blocks with red indices

andreidan

Thanks Keith, left some suggestions

andreidan · 2022-09-14T12:23:45Z

server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java

+                .flatMap(Collection::stream)
+                .collect(Collectors.toSet());
+            if (roles.contains(DiscoveryNodeRole.MASTER_ROLE)) {
+                impacts.addAll(StableMasterHealthIndicatorService.UNSTABLE_MASTER_IMPACTS);


Do we want to report all these? I think one impact that summarises the impact on the master node would be enough? If we decide to have the same impacts (all from the stable master indicator) I think we shouldn't create this coupling between indicators but redefine the impacts in this indicator.

What do you think? (@gmarouli)

I agree, so I would expect that if the disk is actually causing the master to be unstable, then the unstable master will elaborately say the impact. Here we "assume" that a master running out of disk will cause issues, but we do not really know. I would suggest just saying that the master is running out of disk and that could have negative impact on the following areas is enough. What do you think?

What are we going to do if the master and data nodes are fine but the ingest node or ml node are full? We wouldn't report the master node impacts in that case right? But we also wouldn't want to just ignore those would we?

ingest node or ml node are full? We wouldn't report the master node impacts in that case right?

That's correct, we would not return the master node impacts (as the master node is fine w.r.t. disk)

But we also wouldn't want to just ignore those would we?

I assume "those" here are the possible impacts of ingest/ML nodes being full? If so, yes, we should NOT ignore them but report ingest/ML related impacts.
If "those" meant the "master related impacts", I think we should ignore them as the master nodes are fine.

If the master nodes have issues with disk space we should report that the master functions might be impacted
ie. The master nodes' ability to create, delete, or rebalance indices, and insert or update documents might be affected

server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java

andreidan · 2022-09-14T12:27:40Z

server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java

+            if (roles.contains(DiscoveryNodeRole.TRANSFORM_ROLE)) {
+                impacts.add(HEALTH_TRANSFORM_NODES_IMPACT);
+            }
+            diagnosisList = getDiskProblemsDiagnosis(explain, problemNodes, problemIndices);


I believe the diagnosis will be based on the node role right?

Or is the plan to have one diagnosis and mention the various roles in the troubleshooting guide?

IMO we should have a diagnosis for data nodes (with its own id and troubleshooting guide), one for master nodes, etc

What do you think? (@gmarouli )

Update: Maybe having one troubleshooting guide that contains different instructions/role is a good way forward, but I think the messaging in the diagnosis cause and action should be telling with regards to the affected nodes and roles (similar to how we do it in the shards_availability indicator https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java#L186 )

Also, is it time we add a type to the affected_resource and report the problematic node_ids there?

We currently return different things - policy names in the SLM indicator, indices in shards_availability.

What do you think?

I think this requires some thinking. I am pretty confident that we need a distinction id a node contains data or not, because in the on-prem case adding nodes is not going to fix anything.

We could start with this separation and see how it forms, what do you think?

About the affected resources, I agree I think it's time, for now I would stick with indices and nodes, but I think we should abstract maybe data streams, tiers etc. Any thoughts?

Or is the plan to have one diagnosis and mention the various roles in the troubleshooting guide?

Based on the spreadsheet we've been working from I had been thinking one diagnosis for all roles (since in all cases I think it's "find the node and clean up space or add disk") but different impacts for each role (since that will truly be different). We can make different diagnoses for each role but we'll need to work out what those would be.

I am pretty confident that we need a distinction id a node contains data or not, because in the on-prem case adding nodes is not going to fix anything.

That's the one thing we do have here right? If a node has an index, we give a data-specific diagnosis and impact. If the problematic nodes don't have index data then we don't recommend adding nodes. Or am I misunderstanding?

I pushed that code into a separate branch so that we could still talk about it but not have it in this PR: #90077. I'm not tied to that at all if someone wants to take it in a different direction, but wanted to give an idea of the scope of the change.

Based on the spreadsheet we've been working from I had been thinking one diagnosis for all roles (since in all cases I think it's "find the node and clean up space or add disk")

@masseyke I think the solutions will be different. ie. enable autoscaling for data nodes, add capacity for master nodes

I pushed that code into a separate branch so that we could still talk about it but not have it in this PR: #90077. I'm not tied to that at all if someone wants to take it in a different direction, but wanted to give an idea of the scope of the change.

Thanks Keith. I think we'll want something along the lines of

“affected_resources”: { { “type”: “index”, “resources” : [ “index1”, “index2”, …] }, { “type”: “node”, “resources” : [ “node-1”, “node-2, …] }, }

or

“affected_resources”: { “type”: “index”, “resources” : [ “index1”, “index2”, …] , }

But let's have that discussion in a different medium

For the scope of this PR shall we report the node ids in the affected_resources ?

For the scope of this PR shall we report the node ids in the affected_resources ?

Yeah that's what I'm proposing.

@masseyke I think the solutions will be different. ie. enable autoscaling for data nodes, add capacity for master nodes

And that's in line with what we have here right? Different diagnoses for data nodes vs other nodes. I think I'm missing something.

server/src/test/java/org/elasticsearch/health/node/DiskHealthIndicatorServiceTests.java

gmarouli · 2022-09-14T19:13:56Z

server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java

+                .flatMap(Collection::stream)
+                .collect(Collectors.toSet());
+            if (roles.contains(DiscoveryNodeRole.MASTER_ROLE)) {
+                impacts.addAll(StableMasterHealthIndicatorService.UNSTABLE_MASTER_IMPACTS);


I agree, so I would expect that if the disk is actually causing the master to be unstable, then the unstable master will elaborately say the impact. Here we "assume" that a master running out of disk will cause issues, but we do not really know. I would suggest just saying that the master is running out of disk and that could have negative impact on the following areas is enough. What do you think?

server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java

gmarouli · 2022-09-14T19:41:16Z

server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java

+            if (roles.contains(DiscoveryNodeRole.TRANSFORM_ROLE)) {
+                impacts.add(HEALTH_TRANSFORM_NODES_IMPACT);
+            }
+            diagnosisList = getDiskProblemsDiagnosis(explain, problemNodes, problemIndices);


I think this requires some thinking. I am pretty confident that we need a distinction id a node contains data or not, because in the on-prem case adding nodes is not going to fix anything.

We could start with this separation and see how it forms, what do you think?

About the affected resources, I agree I think it's time, for now I would stick with indices and nodes, but I think we should abstract maybe data streams, tiers etc. Any thoughts?

server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java

gmarouli · 2022-09-19T09:19:35Z

server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java

+                        healthStatus
+                    );
+                } else {
+                    healthIndicatorResult = getResultForNonDataNodeProblem(rolesOnRedNodes, nodesReportingRed, details, healthStatus);


I think if we have non data nodes having issues we should still report it because then we need to add the diagnosis to add capacity. Right?

Oh I thought what we had agreed on with @andreidan was to have a hierarchy:
If data node problems, show only data node problems
Else if master node problems, show only master node problems
Else show whatever other nodes have problems.

Maybe I am wrong, I thought as a symptom yes but we would add it in the impact and diagnosis. But I am not sure.

@masseyke we should definitely return all impacts and diagnosis. The grouping is to help with which nodes are reported as part of which impact and diagnosis. To reiterate:

if it's a data node - report data node diagnosis (autoscaling)

if it's a dedicated master node - report the master node diagnosis

if it's a dedicated other node (ML, transform) - report (generic) diagnosis for this kind of node

When it comes to impacts, all affected node roles should be reflected in the impact

If we have nodes in all categories, report them all, each in its corresponding category

We talked offline. We're doing a modification of @andreidan 's comment above -- if a node has any role we'll return the impact and diagnosis for that role.

Disk indicator review

elasticsearchmachine · 2022-09-19T17:42:20Z

Pinging @elastic/es-data-management (Team:Data Management)

gmarouli · 2022-09-19T19:32:16Z

I adjusted the details here: masseyke#8

gmarouli · 2022-09-19T19:44:18Z

server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java

+    }
+
+    @Override
+    public HealthIndicatorResult calculate(boolean explain, HealthInfo healthInfo) {


I find this method difficult to read and reason about.

I am okay if we keep it like this for now, but I would like to go over it one more time and either restructure it a bit or add comments. I think it will be really difficult for someone that hasn't worked on this as extensively as we to maintain it in the future.

If you agree I will create a github issue. Let me know, if it's just me.

I find it difficult to reason about it too.

Hopefully, if we update the behaviour the design will crystalise too ( #90041 (comment) ).

I think you both found it difficult to follow because we had a miscommunication about the expected behavior. I had thought @andreidan only wanted information about the highest-priority problem (I thought we'd gone through that logic offline). I've changed it now to return information about all problems on all node types (data, master, or other) at all levels (red or yellow).

andreidan

Thanks for working on this Keith.

I left a few minor comments but generally, I think we're not reporting some node types in certain conditions. Hopefully this clarifies things a bit.

Also, can you update the description of this PR with a brief up-to-date definition of the indicator? (ie. the conditions it's reporting on)

server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java

…oses

andreidan

LGTM (assuming CI is happy) with a few mentions of subsequent work

Thanks Keith

andreidan · 2022-09-20T20:07:36Z

server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java

+            impacts.add(
+                new HealthIndicatorImpact(
+                    1,
+                    "At risk of not being able to insert or update documents in the affected indices.",


I think this sentence is a bit confusing.

It is missing a subject. What is at risk?

Let's update this in a subsequent PR (create an issue for this)

I will revisit this as part of #90212

andreidan · 2022-09-20T20:08:36Z

server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java

+        }
+        if (redNonDataNonMasterNodes.isEmpty() == false || yellowNonDataNonMasterNodes.isEmpty() == false) {
+            impacts.add(
+                new HealthIndicatorImpact(2, "Some cluster functionality might be unavailable.", List.of(ImpactArea.DEPLOYMENT_MANAGEMENT))


I think the node roles that are affected should define the functionality that might be impaired (ie. ML, transform, ingest pipelines).

Let's fix this in a subsequent PR (create an issue for this)

Created: #90266

andreidan · 2022-09-20T20:10:23Z

server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java

+        Set<String> indicesWithBlock,
+        Set<String> indicesOnRedNodes,
+        Set<String> indicesOnYellowNodes,
+        Set<String> nodesWithBlockedIndices,
+        Set<String> redDataNodes,
+        Set<String> yellowDataNodes,
+        Set<String> redMasterNodes,
+        Set<String> yellowMasterNodes,
+        Set<String> redNonDataNonMasterNodes,
+        Set<String> yellowNonDataNonMasterNodes


I think we should group these a in a couple of records (similar to what we do in https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java#L349 )

Let's create an issue and refactor in a subsequent PR.

Issue created: #90212

Adding disk health indicator

f6dec1d

masseyke added >non-issue :Distributed/Health Issues for the health report API v8.5.0 labels Sep 13, 2022

masseyke added 2 commits September 13, 2022 17:41

adding impacts, removing UNKNOWN status when we dont have health info…

cbc6a52

…, and combining index blocks with red indices

avoiding NPEs

f536b74

andreidan reviewed Sep 14, 2022

View reviewed changes

masseyke added 3 commits September 14, 2022 10:19

code review feedback

fab5572

Starting docs

b16ffff

Merge branch 'main' into feature/disk-health-indicator

291cdc7

masseyke mentioned this pull request Sep 14, 2022

Disk Usage health indicator #84811

Closed

9 tasks

masseyke added 2 commits September 14, 2022 14:25

Adding node name to disk results

9a247c2

Merge branch 'main' into feature/disk-health-indicator

079bc4d

gmarouli reviewed Sep 14, 2022

View reviewed changes

masseyke added 6 commits September 14, 2022 15:01

merging main

9bdcce5

Merge branch 'main' into feature/disk-health-indicator

2941aa5

code review feedback

5cc47f8

fixing integration test

2a150a7

updating with new symptoms, diagnoses, and impacts (work in progress)

e6745e8

more testing

580a02d

gmarouli reviewed Sep 19, 2022

View reviewed changes

gmarouli mentioned this pull request Sep 19, 2022

Enable the health node and the disk health indicator #84811 #90085

Merged

gmarouli and others added 4 commits September 19, 2022 15:22

Revert UNSTABLE_MASTER_IMPACTS back to private

4c1634d

Use one cluster state during indicator calculation

e96ffa7

Adding more tests

c0c284c

Merge pull request #7 from gmarouli/disk-indicator-review

b31fa7e

Disk indicator review

masseyke marked this pull request as ready for review September 19, 2022 17:41

masseyke requested review from andreidan and gmarouli September 19, 2022 17:42

elasticsearchmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Sep 19, 2022

gmarouli reviewed Sep 19, 2022

View reviewed changes

masseyke added 3 commits September 19, 2022 16:39

Adding comments

85e9665

slight reorganization

1a900c3

adding a unit test

a1d83c9

masseyke requested a review from gmarouli September 19, 2022 22:46

andreidan reviewed Sep 20, 2022

View reviewed changes

masseyke added 6 commits September 20, 2022 07:54

code review feedback

a66e908

code review feedback

bd64e6b

Heavily modified symptom; also now support multiple impacts and diagn…

b750fb1

…oses

more unit tests

cff3316

Merge branch 'main' into feature/disk-health-indicator

dd45f9d

comments

60710ef

masseyke requested a review from andreidan September 20, 2022 19:46

andreidan approved these changes Sep 20, 2022

View reviewed changes

masseyke merged commit 2566cd1 into elastic:main Sep 20, 2022

Conversation

masseyke commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreidan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreidan Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

andreidan Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreidan Sep 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Sep 19, 2022

Uh oh!

gmarouli commented Sep 19, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreidan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

masseyke commented Sep 13, 2022 •

edited

Loading

andreidan Sep 15, 2022 •

edited

Loading

andreidan Sep 14, 2022 •

edited

Loading

andreidan Sep 20, 2022 •

edited

Loading

andreidan left a comment •

edited

Loading

andreidan left a comment •

edited

Loading