Adding DiskHealthIndicatorService#90041
Conversation
…, and combining index blocks with red indices
andreidan
left a comment
There was a problem hiding this comment.
Thanks Keith, left some suggestions
| .flatMap(Collection::stream) | ||
| .collect(Collectors.toSet()); | ||
| if (roles.contains(DiscoveryNodeRole.MASTER_ROLE)) { | ||
| impacts.addAll(StableMasterHealthIndicatorService.UNSTABLE_MASTER_IMPACTS); |
There was a problem hiding this comment.
Do we want to report all these? I think one impact that summarises the impact on the master node would be enough? If we decide to have the same impacts (all from the stable master indicator) I think we shouldn't create this coupling between indicators but redefine the impacts in this indicator.
What do you think? (@gmarouli)
There was a problem hiding this comment.
I agree, so I would expect that if the disk is actually causing the master to be unstable, then the unstable master will elaborately say the impact. Here we "assume" that a master running out of disk will cause issues, but we do not really know. I would suggest just saying that the master is running out of disk and that could have negative impact on the following areas is enough. What do you think?
There was a problem hiding this comment.
What are we going to do if the master and data nodes are fine but the ingest node or ml node are full? We wouldn't report the master node impacts in that case right? But we also wouldn't want to just ignore those would we?
There was a problem hiding this comment.
ingest node or ml node are full? We wouldn't report the master node impacts in that case right?
That's correct, we would not return the master node impacts (as the master node is fine w.r.t. disk)
But we also wouldn't want to just ignore those would we?
I assume "those" here are the possible impacts of ingest/ML nodes being full? If so, yes, we should NOT ignore them but report ingest/ML related impacts.
If "those" meant the "master related impacts", I think we should ignore them as the master nodes are fine.
If the master nodes have issues with disk space we should report that the master functions might be impacted
ie. The master nodes' ability to create, delete, or rebalance indices, and insert or update documents might be affected
server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
| if (roles.contains(DiscoveryNodeRole.TRANSFORM_ROLE)) { | ||
| impacts.add(HEALTH_TRANSFORM_NODES_IMPACT); | ||
| } | ||
| diagnosisList = getDiskProblemsDiagnosis(explain, problemNodes, problemIndices); |
There was a problem hiding this comment.
I believe the diagnosis will be based on the node role right?
Or is the plan to have one diagnosis and mention the various roles in the troubleshooting guide?
IMO we should have a diagnosis for data nodes (with its own id and troubleshooting guide), one for master nodes, etc
What do you think? (@gmarouli )
Update: Maybe having one troubleshooting guide that contains different instructions/role is a good way forward, but I think the messaging in the diagnosis cause and action should be telling with regards to the affected nodes and roles (similar to how we do it in the shards_availability indicator https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java#L186 )
There was a problem hiding this comment.
Also, is it time we add a type to the affected_resource and report the problematic node_ids there?
We currently return different things - policy names in the SLM indicator, indices in shards_availability.
What do you think?
There was a problem hiding this comment.
I think this requires some thinking. I am pretty confident that we need a distinction id a node contains data or not, because in the on-prem case adding nodes is not going to fix anything.
We could start with this separation and see how it forms, what do you think?
About the affected resources, I agree I think it's time, for now I would stick with indices and nodes, but I think we should abstract maybe data streams, tiers etc. Any thoughts?
There was a problem hiding this comment.
Or is the plan to have one diagnosis and mention the various roles in the troubleshooting guide?
Based on the spreadsheet we've been working from I had been thinking one diagnosis for all roles (since in all cases I think it's "find the node and clean up space or add disk") but different impacts for each role (since that will truly be different). We can make different diagnoses for each role but we'll need to work out what those would be.
There was a problem hiding this comment.
I am pretty confident that we need a distinction id a node contains data or not, because in the on-prem case adding nodes is not going to fix anything.
That's the one thing we do have here right? If a node has an index, we give a data-specific diagnosis and impact. If the problematic nodes don't have index data then we don't recommend adding nodes. Or am I misunderstanding?
There was a problem hiding this comment.
I pushed that code into a separate branch so that we could still talk about it but not have it in this PR: #90077. I'm not tied to that at all if someone wants to take it in a different direction, but wanted to give an idea of the scope of the change.
There was a problem hiding this comment.
Based on the spreadsheet we've been working from I had been thinking one diagnosis for all roles (since in all cases I think it's "find the node and clean up space or add disk")
@masseyke I think the solutions will be different. ie. enable autoscaling for data nodes, add capacity for master nodes
There was a problem hiding this comment.
I pushed that code into a separate branch so that we could still talk about it but not have it in this PR: #90077. I'm not tied to that at all if someone wants to take it in a different direction, but wanted to give an idea of the scope of the change.
Thanks Keith. I think we'll want something along the lines of
“affected_resources”: {
{ “type”: “index”, “resources” : [ “index1”, “index2”, …] },
{ “type”: “node”, “resources” : [ “node-1”, “node-2, …] },
}
or
“affected_resources”: {
“type”: “index”, “resources” : [ “index1”, “index2”, …] ,
}
But let's have that discussion in a different medium
For the scope of this PR shall we report the node ids in the affected_resources ?
There was a problem hiding this comment.
For the scope of this PR shall we report the node ids in the affected_resources ?
Yeah that's what I'm proposing.
There was a problem hiding this comment.
@masseyke I think the solutions will be different. ie. enable autoscaling for data nodes, add capacity for master nodes
And that's in line with what we have here right? Different diagnoses for data nodes vs other nodes. I think I'm missing something.
server/src/test/java/org/elasticsearch/health/node/DiskHealthIndicatorServiceTests.java
Outdated
Show resolved
Hide resolved
| .flatMap(Collection::stream) | ||
| .collect(Collectors.toSet()); | ||
| if (roles.contains(DiscoveryNodeRole.MASTER_ROLE)) { | ||
| impacts.addAll(StableMasterHealthIndicatorService.UNSTABLE_MASTER_IMPACTS); |
There was a problem hiding this comment.
I agree, so I would expect that if the disk is actually causing the master to be unstable, then the unstable master will elaborately say the impact. Here we "assume" that a master running out of disk will cause issues, but we do not really know. I would suggest just saying that the master is running out of disk and that could have negative impact on the following areas is enough. What do you think?
server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
| if (roles.contains(DiscoveryNodeRole.TRANSFORM_ROLE)) { | ||
| impacts.add(HEALTH_TRANSFORM_NODES_IMPACT); | ||
| } | ||
| diagnosisList = getDiskProblemsDiagnosis(explain, problemNodes, problemIndices); |
There was a problem hiding this comment.
I think this requires some thinking. I am pretty confident that we need a distinction id a node contains data or not, because in the on-prem case adding nodes is not going to fix anything.
We could start with this separation and see how it forms, what do you think?
About the affected resources, I agree I think it's time, for now I would stick with indices and nodes, but I think we should abstract maybe data streams, tiers etc. Any thoughts?
server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
| healthStatus | ||
| ); | ||
| } else { | ||
| healthIndicatorResult = getResultForNonDataNodeProblem(rolesOnRedNodes, nodesReportingRed, details, healthStatus); |
There was a problem hiding this comment.
I think if we have non data nodes having issues we should still report it because then we need to add the diagnosis to add capacity. Right?
There was a problem hiding this comment.
Oh I thought what we had agreed on with @andreidan was to have a hierarchy:
If data node problems, show only data node problems
Else if master node problems, show only master node problems
Else show whatever other nodes have problems.
There was a problem hiding this comment.
Maybe I am wrong, I thought as a symptom yes but we would add it in the impact and diagnosis. But I am not sure.
There was a problem hiding this comment.
@masseyke we should definitely return all impacts and diagnosis. The grouping is to help with which nodes are reported as part of which impact and diagnosis. To reiterate:
- if it's a data node - report data node diagnosis (autoscaling)
- if it's a dedicated master node - report the master node diagnosis
- if it's a dedicated other node (ML, transform) - report (generic) diagnosis for this kind of node
When it comes to impacts, all affected node roles should be reflected in the impact
If we have nodes in all categories, report them all, each in its corresponding category
There was a problem hiding this comment.
We talked offline. We're doing a modification of @andreidan 's comment above -- if a node has any role we'll return the impact and diagnosis for that role.
|
Pinging @elastic/es-data-management (Team:Data Management) |
|
I adjusted the details here: masseyke#8 |
| } | ||
|
|
||
| @Override | ||
| public HealthIndicatorResult calculate(boolean explain, HealthInfo healthInfo) { |
There was a problem hiding this comment.
I find this method difficult to read and reason about.
I am okay if we keep it like this for now, but I would like to go over it one more time and either restructure it a bit or add comments. I think it will be really difficult for someone that hasn't worked on this as extensively as we to maintain it in the future.
If you agree I will create a github issue. Let me know, if it's just me.
There was a problem hiding this comment.
I find it difficult to reason about it too.
Hopefully, if we update the behaviour the design will crystalise too ( #90041 (comment) ).
There was a problem hiding this comment.
I think you both found it difficult to follow because we had a miscommunication about the expected behavior. I had thought @andreidan only wanted information about the highest-priority problem (I thought we'd gone through that logic offline). I've changed it now to return information about all problems on all node types (data, master, or other) at all levels (red or yellow).
There was a problem hiding this comment.
Thanks for working on this Keith.
I left a few minor comments but generally, I think we're not reporting some node types in certain conditions. Hopefully this clarifies things a bit.
Also, can you update the description of this PR with a brief up-to-date definition of the indicator? (ie. the conditions it's reporting on)
server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/DiskHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
| impacts.add( | ||
| new HealthIndicatorImpact( | ||
| 1, | ||
| "At risk of not being able to insert or update documents in the affected indices.", |
There was a problem hiding this comment.
I think this sentence is a bit confusing.
It is missing a subject. What is at risk?
Let's update this in a subsequent PR (create an issue for this)
| } | ||
| if (redNonDataNonMasterNodes.isEmpty() == false || yellowNonDataNonMasterNodes.isEmpty() == false) { | ||
| impacts.add( | ||
| new HealthIndicatorImpact(2, "Some cluster functionality might be unavailable.", List.of(ImpactArea.DEPLOYMENT_MANAGEMENT)) |
There was a problem hiding this comment.
I think the node roles that are affected should define the functionality that might be impaired (ie. ML, transform, ingest pipelines).
Let's fix this in a subsequent PR (create an issue for this)
| Set<String> indicesWithBlock, | ||
| Set<String> indicesOnRedNodes, | ||
| Set<String> indicesOnYellowNodes, | ||
| Set<String> nodesWithBlockedIndices, | ||
| Set<String> redDataNodes, | ||
| Set<String> yellowDataNodes, | ||
| Set<String> redMasterNodes, | ||
| Set<String> yellowMasterNodes, | ||
| Set<String> redNonDataNonMasterNodes, | ||
| Set<String> yellowNonDataNonMasterNodes |
There was a problem hiding this comment.
I think we should group these a in a couple of records (similar to what we do in https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java#L349 )
Let's create an issue and refactor in a subsequent PR.
This PR adds a new disk health indicator. The output looks something like this:
There are 3 classes of problems in this indicator: (1) an index block / data node problem, (2) a node with
masterrole is filling up, and (3) any other node (besides master or data) is filling up. There will be at most 3 impacts and 3 diagnoses -- one for each of these classes of problems, regardless of whether the status of any node is red or yellow.For example say we have a 5 node cluster:
Say we have an index that has a block on it in the cluster state (index3), and that index lives on node3 and node4. Let's say that even though the lock is still in the cluster state node 3 and node4 have fallen below the flood stage so they return YELLOW. And node5 is over the flood limit so it returns RED. Then we'd expect the overall status to be RED (both because of the blocked index and the RED node). We'd expect the
symptomto list problems with nodes 3, 4, and 5. And we'd expect two impacts and two diagnoses -- one each for the index / data node problem and the master node problem. The result might look like this: