Health API - Monitoring local disk health#88390
Health API - Monitoring local disk health#88390elasticsearchmachine merged 28 commits intoelastic:mainfrom
Conversation
5013389 to
d636893
Compare
|
@elasticmachine update branch |
|
@elasticmachine update branch |
|
Pinging @elastic/es-data-management (Team:Data Management) |
andreidan
left a comment
There was a problem hiding this comment.
Thanks for working on this Mary
I left a first set of comments (and a suggestion to break this PR up)
server/src/main/java/org/elasticsearch/health/node/LocalHealthMonitor.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/LocalHealthMonitor.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/LocalHealthMonitor.java
Outdated
Show resolved
Hide resolved
| if (healthMetadataInitialized == false) { | ||
| healthMetadataInitialized = HealthMetadata.getFromClusterState(event.state()) != null; | ||
| if (healthMetadataInitialized) { | ||
| scheduleNextRunIfEnabled(TimeValue.timeValueMillis(1)); |
There was a problem hiding this comment.
I think this should schedule based the monitoring based on the normal interval
There was a problem hiding this comment.
hm, you mean that if it gets rescheduled it should have the initial delay?
I have to admit that when I see something executed at an interval I do not expect to have an initial delay. I expect that it's going to be executed asap and then wait for the interval before the next execution. For example let's say that the interval is 10 minutes. I would think it's wrong to wait for 10 minutes before running it for the first time.
What do you think?
server/src/main/java/org/elasticsearch/health/node/LocalHealthMonitor.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/LocalHealthMonitor.java
Outdated
Show resolved
Hide resolved
|
Update:
|
|
Investigating the test failure. I can reproduce the failure locally too with this seed: |
|
Test fixed. The issue was that we removed the |
|
@elasticmachine update branch |
|
@elasticmachine run elasticsearch-ci/packaging-tests-unix-sample |
andreidan
left a comment
There was a problem hiding this comment.
Thanks for iterating on this Mary
Left a few more suggestions
server/src/main/java/org/elasticsearch/common/settings/ClusterSettings.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/LocalHealthMonitor.java
Outdated
Show resolved
Hide resolved
| if (inProgress.compareAndSet(false, true)) { | ||
| ClusterState clusterState = clusterService.state(); | ||
| HealthMetadata healthMetadata = HealthMetadata.getFromClusterState(clusterState); | ||
| assert healthMetadata != null : "health metadata should have been initialized."; |
There was a problem hiding this comment.
is this assertion fair?
The master node might've not initialised the health metadata but the monitor might be running because the service is enabled ?
There was a problem hiding this comment.
Before we schedule any monitoring task we check the flag healthMetadataInitialized so theoretically this should hold true. The only case I can think of, is if someone removes the health metadata. But I think we do not give that option programmatically. Would you prefer to have an if here?
There was a problem hiding this comment.
Ah, right. The HealthMetadataService is controlled by a different setting - enabled/disabled via health.node.enabled so that made me think things can diverge.
I prefer the explicit checks, but it's a personal preference so feel free to ignore
There was a problem hiding this comment.
They are all controlled by the same flag health.node.enabled. The reason I have done this, is that if the health node is disabled, there is no point to monitor the disk health locally, or to publish health metadata right?
On a second thought, depending on the use case @jbaiera is working on we might want to decouple this and have the health metadata always be published. They might be useful for the coordinator node too.
server/src/main/java/org/elasticsearch/health/node/LocalHealthMonitor.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/LocalHealthMonitor.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/LocalHealthMonitor.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/IndividualNodeHealth.java
Outdated
Show resolved
Hide resolved
| // Wait until every node in the cluster is upgraded to 8.4.0 or later | ||
| if (event.state().nodesIfRecovered().getMinNodeVersion().onOrAfter(Version.V_8_4_0)) { | ||
| // Wait until the health metadata is available in the cluster state | ||
| if (healthMetadataInitialized == false) { |
There was a problem hiding this comment.
This flag is a little misleading as it waits for a bit more than just the metadata to be initialised (ie. nodes versions). Could we rename it to reflect that?
Also, if the service is enabled,disabled, and enabled again we might end up with 2 scheduled monitoring tracks? (since setEnabled doesn't check if we're already running?)
There was a problem hiding this comment.
About the scheduling two monitoring tasks. That's true, but then one will see that the inProgress flag is not false and it will just exit without doing anything or scheduling anything. I thought it was good trade-off compared to locking. What do you think?
There was a problem hiding this comment.
Yes ! That's a good trade-off 🚀 Can you please document it?
There was a problem hiding this comment.
Ah yes! Good point.
There was a problem hiding this comment.
Done, let me know if it's clear.
There was a problem hiding this comment.
Also renamed healthMetadataInitialized to something more inclusive.
server/src/main/java/org/elasticsearch/health/node/selection/HealthNode.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/health/node/LocalHealthMonitor.java
Outdated
Show resolved
Hide resolved
andreidan
left a comment
There was a problem hiding this comment.
Thanks for iterating on this.
Left a could more comments
server/src/main/java/org/elasticsearch/health/node/LocalHealthMonitor.java
Show resolved
Hide resolved
| // monitoring tasks scheduled, one of them will be no-op. | ||
| private final AtomicBoolean inProgress = new AtomicBoolean(); | ||
| // Keeps the latest health state that was successfully reported. | ||
| private DiskHealthInfo lastReportedDiskHealthInfo = null; |
There was a problem hiding this comment.
should this be volatile to ensure it's written to main memory? (being written from another thread, without volatile it might not be immediately visible as it'll not be flushed to main memory ie. getLastReportedDiskHealthInfo might not see the "last reported" info)
There was a problem hiding this comment.
Yeap, I probably removed it accidentally at some point.
| if (prerequisitesFulfilled == false) { | ||
| prerequisitesFulfilled = event.state().nodesIfRecovered().getMinNodeVersion().onOrAfter(Version.V_8_5_0) | ||
| && HealthMetadata.getFromClusterState(event.state()) != null; | ||
| scheduleNowIfEnabled(); | ||
| } |
There was a problem hiding this comment.
this seems to schedule the polling even if the prerequisites were not fulfilled?
could the version check be still 8_4_0 as we have the metadata already? or is there a reason we should wait for 8.5?
There was a problem hiding this comment.
Hm, I think I need to rename this method. We check the prerequisites in the method scheduleNowIfEnabled. This was a choice I made to ensure the prerequisites are always checked before any scheduling. For this reason it felt redundant to check it again here. What do you think about renaming to maybeScheduleNow() indicating that it is not a given that it will schedule it?
There was a problem hiding this comment.
About 8.5.0 you are right the healthMetadata do exists in 8.4.0 but this local monitor will be pushing changes to the HealthNode and that endpoint is not implemented in 8.4.0. That's why I thought that the minimum required version should be 8.5.0. Does this make sense?
andreidan
left a comment
There was a problem hiding this comment.
LGTM thanks for iterating on this Mary 🚀
|
@elasticmachine update branch |
|
@elasticmachine run elasticsearch-ci/part-1 |
|
@andreidan thanks for the feedback! |
This PR introduces the local health monitoring functionality needed for #84811 . The monitor uses the
NodeServiceto get the disk usage stats and determines the node's disk health.When a change in the disk's is detected or when the health node changes, this class would be responsible to send the node's health to the health node. Currently this is simulated with a method that just logs the current health.
The monitor keeps the last reported health, this way, if something fails on the next check it will try to resend the new health state.