Skip to content

Adding DiskHealthIndicatorService#90041

Merged
masseyke merged 27 commits intoelastic:mainfrom
masseyke:feature/disk-health-indicator
Sep 20, 2022
Merged

Adding DiskHealthIndicatorService#90041
masseyke merged 27 commits intoelastic:mainfrom
masseyke:feature/disk-health-indicator

Conversation

@masseyke
Copy link
Copy Markdown
Member

@masseyke masseyke commented Sep 13, 2022

This PR adds a new disk health indicator. The output looks something like this:

        "disk_health": {
            "status": "red",
            "symptom": "2 nodes with roles: [data_hot, master] are out of disk space.",
            "details": {
                "disk": [
                    {
                        "node_id": "YlPqjTs2Qq2zROx8U7CamA",
                        "name": "node-1",
                        "status": "UNKNOWN"
                    },
                    {
                        "node_id": "ssbKOKXHQteYctBB9kGCqg",
                        "name": "node-2",
                        "status": "RED"
                    }
                ]
            },
            "impacts": [
                {
                    "severity": 1,
                    "description": "Cannot insert or update documents in the affected indices.",
                    "impact_areas": [
                        "ingest"
                    ]
                },
                {
                    "severity": 2,
                    "description": "Cluster stability might be impaired.",
                    "impact_areas": [
                        "deployment_management"
                    ]
                }
            ],
            "diagnosis": [
                {
                    "cause": "3 indices reside on nodes that have run out of space and writing has been blocked by the system.",
                    "action": "Enable autoscaling (if applicable), add disk capacity or free up disk space to resolve this. If you have already taken action please wait for the rebalancing to complete.",
                    "affected_resources": [
                        "YlPqjTs2Qq2zROx8U7CamA",
                        "ssbKOKXHQteYctBB9kGCqg"
                    ],
                    "help_url": "https://ela.st/free-disk-space-or-add-capacity-data-nodes"
                },
                {
                    "cause": "Disk is almost full.",
                    "action": "Please add capacity to the current nodes, or replace them with ones with higher capacity.",
                    "affected_resources": [
                        "YlPqjTs2Qq2zROx8U7CamA"
                    ],
                    "help_url": "https://ela.st/free-disk-space-or-add-capacity-master-nodes"
                }
            ]
        },

There are 3 classes of problems in this indicator: (1) an index block / data node problem, (2) a node with master role is filling up, and (3) any other node (besides master or data) is filling up. There will be at most 3 impacts and 3 diagnoses -- one for each of these classes of problems, regardless of whether the status of any node is red or yellow.
For example say we have a 5 node cluster:

Node Roles Status Indices
node1 master, data_hot GREEN index1
node2 master, data_hot GREEN index2
node3 master, data_hot YELLOW index3(*)
node4 data_hot YELLOW index3(*)
node5 data_hot RED index2

Say we have an index that has a block on it in the cluster state (index3), and that index lives on node3 and node4. Let's say that even though the lock is still in the cluster state node 3 and node4 have fallen below the flood stage so they return YELLOW. And node5 is over the flood limit so it returns RED. Then we'd expect the overall status to be RED (both because of the blocked index and the RED node). We'd expect the symptom to list problems with nodes 3, 4, and 5. And we'd expect two impacts and two diagnoses -- one each for the index / data node problem and the master node problem. The result might look like this:

        "disk_health": {
            "status": "red",
            "symptom": "3 nodes with roles: [data_hot, master] are out of disk space.",
            "details": {
                "disk": [
                    {
                        "node_id": "LQqafXKMQcmAZQU9AG2shQ",
                        "name": "node-1",
                        "status": "GREEN"
                    },
                    {
                        "node_id": "sqkFF0HeRVm00_oDJdhZiA",
                        "name": "node-2",
                        "status": "GREEN"
                    },
                    {
                        "node_id": "YlPqjTs2Qq2zROx8U7CamA",
                        "name": "node-3",
                        "status": "YELLOW"
                    },
                    {
                        "node_id": "ssbKOKXHQteYctBB9kGCqg",
                        "name": "node-4",
                        "status": "YELLOW"
                    },
                    {
                        "node_id": "KvX6n7AGQdOX9ELmeGNqKw",
                        "name": "node-5",
                        "status": "RED"
                    }
                ]
            },
            "impacts": [
                {
                    "severity": 1,
                    "description": "Cannot insert or update documents in the affected indices.",
                    "impact_areas": [
                        "ingest"
                    ]
                },
                {
                    "severity": 2,
                    "description": "Cluster stability might be impaired.",
                    "impact_areas": [
                        "deployment_management"
                    ]
                }
            ],
            "diagnosis": [
                {
                    "cause": "2 indices reside on nodes that have run out of space and writing has been blocked by the system.",
                    "action": "Enable autoscaling (if applicable), add disk capacity or free up disk space to resolve this. If you have already taken action please wait for the rebalancing to complete.",
                    "affected_resources": [
                        "YlPqjTs2Qq2zROx8U7CamA",
                        "ssbKOKXHQteYctBB9kGCqg",
                        "KvX6n7AGQdOX9ELmeGNqKw"
                    ],
                    "help_url": "https://ela.st/free-disk-space-or-add-capacity-data-nodes"
                },
                {
                    "cause": "Disk is almost full.",
                    "action": "Please add capacity to the current nodes, or replace them with ones with higher capacity.",
                    "affected_resources": [
                        "YlPqjTs2Qq2zROx8U7CamA"
                    ],
                    "help_url": "https://ela.st/free-disk-space-or-add-capacity-master-nodes"
                }
            ]
        },

@masseyke masseyke added >non-issue :Distributed/Health Issues for the health report API v8.5.0 labels Sep 13, 2022
Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Keith, left some suggestions

.flatMap(Collection::stream)
.collect(Collectors.toSet());
if (roles.contains(DiscoveryNodeRole.MASTER_ROLE)) {
impacts.addAll(StableMasterHealthIndicatorService.UNSTABLE_MASTER_IMPACTS);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to report all these? I think one impact that summarises the impact on the master node would be enough? If we decide to have the same impacts (all from the stable master indicator) I think we shouldn't create this coupling between indicators but redefine the impacts in this indicator.

What do you think? (@gmarouli)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, so I would expect that if the disk is actually causing the master to be unstable, then the unstable master will elaborately say the impact. Here we "assume" that a master running out of disk will cause issues, but we do not really know. I would suggest just saying that the master is running out of disk and that could have negative impact on the following areas is enough. What do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we going to do if the master and data nodes are fine but the ingest node or ml node are full? We wouldn't report the master node impacts in that case right? But we also wouldn't want to just ignore those would we?

Copy link
Copy Markdown
Contributor

@andreidan andreidan Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ingest node or ml node are full? We wouldn't report the master node impacts in that case right?

That's correct, we would not return the master node impacts (as the master node is fine w.r.t. disk)

But we also wouldn't want to just ignore those would we?

I assume "those" here are the possible impacts of ingest/ML nodes being full? If so, yes, we should NOT ignore them but report ingest/ML related impacts.
If "those" meant the "master related impacts", I think we should ignore them as the master nodes are fine.

If the master nodes have issues with disk space we should report that the master functions might be impacted
ie. The master nodes' ability to create, delete, or rebalance indices, and insert or update documents might be affected

if (roles.contains(DiscoveryNodeRole.TRANSFORM_ROLE)) {
impacts.add(HEALTH_TRANSFORM_NODES_IMPACT);
}
diagnosisList = getDiskProblemsDiagnosis(explain, problemNodes, problemIndices);
Copy link
Copy Markdown
Contributor

@andreidan andreidan Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the diagnosis will be based on the node role right?

Or is the plan to have one diagnosis and mention the various roles in the troubleshooting guide?

IMO we should have a diagnosis for data nodes (with its own id and troubleshooting guide), one for master nodes, etc

What do you think? (@gmarouli )

Update: Maybe having one troubleshooting guide that contains different instructions/role is a good way forward, but I think the messaging in the diagnosis cause and action should be telling with regards to the affected nodes and roles (similar to how we do it in the shards_availability indicator https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java#L186 )

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it time we add a type to the affected_resource and report the problematic node_ids there?

We currently return different things - policy names in the SLM indicator, indices in shards_availability.

What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this requires some thinking. I am pretty confident that we need a distinction id a node contains data or not, because in the on-prem case adding nodes is not going to fix anything.

We could start with this separation and see how it forms, what do you think?

About the affected resources, I agree I think it's time, for now I would stick with indices and nodes, but I think we should abstract maybe data streams, tiers etc. Any thoughts?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the plan to have one diagnosis and mention the various roles in the troubleshooting guide?

Based on the spreadsheet we've been working from I had been thinking one diagnosis for all roles (since in all cases I think it's "find the node and clean up space or add disk") but different impacts for each role (since that will truly be different). We can make different diagnoses for each role but we'll need to work out what those would be.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty confident that we need a distinction id a node contains data or not, because in the on-prem case adding nodes is not going to fix anything.

That's the one thing we do have here right? If a node has an index, we give a data-specific diagnosis and impact. If the problematic nodes don't have index data then we don't recommend adding nodes. Or am I misunderstanding?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed that code into a separate branch so that we could still talk about it but not have it in this PR: #90077. I'm not tied to that at all if someone wants to take it in a different direction, but wanted to give an idea of the scope of the change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the spreadsheet we've been working from I had been thinking one diagnosis for all roles (since in all cases I think it's "find the node and clean up space or add disk")

@masseyke I think the solutions will be different. ie. enable autoscaling for data nodes, add capacity for master nodes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed that code into a separate branch so that we could still talk about it but not have it in this PR: #90077. I'm not tied to that at all if someone wants to take it in a different direction, but wanted to give an idea of the scope of the change.

Thanks Keith. I think we'll want something along the lines of

“affected_resources”: { 
  { “type”: “index”, “resources” : [ “index1”, “index2”, …] },
  {  “type”: “node”, “resources” : [ “node-1”, “node-2, …] }, 
}

or

“affected_resources”: { 
  “type”: “index”, “resources” : [ “index1”, “index2”, …] ,
}

But let's have that discussion in a different medium

For the scope of this PR shall we report the node ids in the affected_resources ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the scope of this PR shall we report the node ids in the affected_resources ?

Yeah that's what I'm proposing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masseyke I think the solutions will be different. ie. enable autoscaling for data nodes, add capacity for master nodes

And that's in line with what we have here right? Different diagnoses for data nodes vs other nodes. I think I'm missing something.

@masseyke masseyke mentioned this pull request Sep 14, 2022
9 tasks
.flatMap(Collection::stream)
.collect(Collectors.toSet());
if (roles.contains(DiscoveryNodeRole.MASTER_ROLE)) {
impacts.addAll(StableMasterHealthIndicatorService.UNSTABLE_MASTER_IMPACTS);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, so I would expect that if the disk is actually causing the master to be unstable, then the unstable master will elaborately say the impact. Here we "assume" that a master running out of disk will cause issues, but we do not really know. I would suggest just saying that the master is running out of disk and that could have negative impact on the following areas is enough. What do you think?

if (roles.contains(DiscoveryNodeRole.TRANSFORM_ROLE)) {
impacts.add(HEALTH_TRANSFORM_NODES_IMPACT);
}
diagnosisList = getDiskProblemsDiagnosis(explain, problemNodes, problemIndices);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this requires some thinking. I am pretty confident that we need a distinction id a node contains data or not, because in the on-prem case adding nodes is not going to fix anything.

We could start with this separation and see how it forms, what do you think?

About the affected resources, I agree I think it's time, for now I would stick with indices and nodes, but I think we should abstract maybe data streams, tiers etc. Any thoughts?

healthStatus
);
} else {
healthIndicatorResult = getResultForNonDataNodeProblem(rolesOnRedNodes, nodesReportingRed, details, healthStatus);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we have non data nodes having issues we should still report it because then we need to add the diagnosis to add capacity. Right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I thought what we had agreed on with @andreidan was to have a hierarchy:
If data node problems, show only data node problems
Else if master node problems, show only master node problems
Else show whatever other nodes have problems.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am wrong, I thought as a symptom yes but we would add it in the impact and diagnosis. But I am not sure.

Copy link
Copy Markdown
Contributor

@andreidan andreidan Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masseyke we should definitely return all impacts and diagnosis. The grouping is to help with which nodes are reported as part of which impact and diagnosis. To reiterate:

  1. if it's a data node - report data node diagnosis (autoscaling)
  2. if it's a dedicated master node - report the master node diagnosis
  3. if it's a dedicated other node (ML, transform) - report (generic) diagnosis for this kind of node

When it comes to impacts, all affected node roles should be reflected in the impact

If we have nodes in all categories, report them all, each in its corresponding category

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked offline. We're doing a modification of @andreidan 's comment above -- if a node has any role we'll return the impact and diagnosis for that role.

@masseyke masseyke marked this pull request as ready for review September 19, 2022 17:41
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Sep 19, 2022
@gmarouli
Copy link
Copy Markdown
Contributor

I adjusted the details here: masseyke#8

}

@Override
public HealthIndicatorResult calculate(boolean explain, HealthInfo healthInfo) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this method difficult to read and reason about.

I am okay if we keep it like this for now, but I would like to go over it one more time and either restructure it a bit or add comments. I think it will be really difficult for someone that hasn't worked on this as extensively as we to maintain it in the future.

If you agree I will create a github issue. Let me know, if it's just me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it difficult to reason about it too.

Hopefully, if we update the behaviour the design will crystalise too ( #90041 (comment) ).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you both found it difficult to follow because we had a miscommunication about the expected behavior. I had thought @andreidan only wanted information about the highest-priority problem (I thought we'd gone through that logic offline). I've changed it now to return information about all problems on all node types (data, master, or other) at all levels (red or yellow).

@masseyke masseyke requested a review from gmarouli September 19, 2022 22:46
Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Keith.

I left a few minor comments but generally, I think we're not reporting some node types in certain conditions. Hopefully this clarifies things a bit.

Also, can you update the description of this PR with a brief up-to-date definition of the indicator? (ie. the conditions it's reporting on)

@masseyke masseyke requested a review from andreidan September 20, 2022 19:46
Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (assuming CI is happy) with a few mentions of subsequent work

Thanks Keith

impacts.add(
new HealthIndicatorImpact(
1,
"At risk of not being able to insert or update documents in the affected indices.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sentence is a bit confusing.

It is missing a subject. What is at risk?

Let's update this in a subsequent PR (create an issue for this)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revisit this as part of #90212

}
if (redNonDataNonMasterNodes.isEmpty() == false || yellowNonDataNonMasterNodes.isEmpty() == false) {
impacts.add(
new HealthIndicatorImpact(2, "Some cluster functionality might be unavailable.", List.of(ImpactArea.DEPLOYMENT_MANAGEMENT))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the node roles that are affected should define the functionality that might be impaired (ie. ML, transform, ingest pipelines).

Let's fix this in a subsequent PR (create an issue for this)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created: #90266

Comment on lines +235 to +244
Set<String> indicesWithBlock,
Set<String> indicesOnRedNodes,
Set<String> indicesOnYellowNodes,
Set<String> nodesWithBlockedIndices,
Set<String> redDataNodes,
Set<String> yellowDataNodes,
Set<String> redMasterNodes,
Set<String> yellowMasterNodes,
Set<String> redNonDataNonMasterNodes,
Set<String> yellowNonDataNonMasterNodes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should group these a in a couple of records (similar to what we do in https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java#L349 )

Let's create an issue and refactor in a subsequent PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue created: #90212

@masseyke masseyke merged commit 2566cd1 into elastic:main Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Health Issues for the health report API >non-issue Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants