-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Disk Usage health indicator #84811
Copy link
Copy link
Closed
Labels
:Distributed/HealthIssues for the health report APIIssues for the health report API>featureTeam:Data Management (obsolete)DO NOT USE. This team no longer exists.DO NOT USE. This team no longer exists.
Description
Create a disk usage indicator that report to the user when their cluster is running out of space and the impact this has for its function. We propose the following health status and their interpretations:
| Status | Meaning | Implementation |
|---|---|---|
| RED | the disk is running out of space on at least one node or writes are blocked because of limited disk space. | At least one node is above the flooding watermark, or at least one index is blocked by READ_ONLY_ALLOW_DELETE_BLOCK |
| YELLOW | There is increased disk usage on at least one node. | At least one data node is above the high watermark with no relocating shards or a non-data node is above the high watermark.* |
| GREEN | All good, nothing elasticsearch cannot handle :) . | If none of the above apply. |
Implementation details
The collection of the data should be done using the persistent tasks frameworks.
Nodes will listen to cluster state changes for the allocation of the "health persistent task" and push their initial status. After the initialization, the nodes will only push changes to their state (ie. when they change from RED to YELLOW).
The allocated persistent task should be prepared to delay a potential initial request for health if the request arrives before it got a chance to receive the statuses from the nodes.
- Introduce the persistent task (Persistent health task #86131)
- Propagate disk usage thresholds and watermarks to all nodes (Add disk thresholds in the cluster state #88175)
- Introduce thresholds for non-data nodes (parked for now, we want to see if reusing the flood stage and high watermarks is good enough.
- Monitor a node's disk usage health (Health API - Monitoring local disk health #88390)
- [Health node] Cache each node's disk usage health (Health info overview #89275)
- The coordinating node retrieves the health info from the health node (Fetch health info action #89820, Updating HealthService to use FetchHealthInfoCacheAction #89947)
- Use the retrieved disk usage health info and the blocked indices from the cluster state (if they exist) to compute the indicator (Adding DiskHealthIndicatorService #90041)
- Remove the feature flag (Enable the health node and the disk health indicator #84811 #90085)
- Write troubleshooting doc & document the new settings (Disk indicator troubleshooting #90504)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
:Distributed/HealthIssues for the health report APIIssues for the health report API>featureTeam:Data Management (obsolete)DO NOT USE. This team no longer exists.DO NOT USE. This team no longer exists.
Type
Fields
Give feedbackNo fields configured for issues without a type.