More logging for slow cluster state application#45007
More logging for slow cluster state application#45007DaveCTurner merged 5 commits intoelastic:masterfrom
Conversation
Today the lag detector may remove nodes from the cluster if they fail to apply a cluster state within a reasonable timeframe, but it is rather unclear from the default logging that this has occurred and there is very little extra information beyond the fact that the removed node was lagging. Moreover the only forewarning that the lag detector might be invoked is a message indicating that cluster state publication took unreasonably long, which does not contain enough information to investigate the problem further. This commit adds a good deal more detail to make the issues of slow nodes more prominent: - every 10 seconds (by default) we log a warning indicating that a publication is still waiting for responses from some nodes, including the identities of the problematic nodes. - the lag detector logs a more detailed warning when a fatally-lagging node is detected. - if applying a cluster state takes too long then the cluster applier service logs a breakdown of all the tasks it ran as part of that process.
|
Pinging @elastic/es-distributed |
|
The one-line log message from the
|
| } | ||
| } | ||
| lagDetector.startLagDetector(publishRequest.getAcceptedState().version()); | ||
| logIncompleteNodes(); |
There was a problem hiding this comment.
Because 10s divides 30s exactly, there is a chance that by default sometimes we'll log the third periodic warning and then the end-of-publication warning in quick succession. We could just not log the end-of-publication warning here.
There was a problem hiding this comment.
@DaveCTurner that's not so bad imo, we'll see the thread this was logged from in the logs as well so it's not that weird? (put differently, it doesn't bother me :P and maybe by some random chance this will help debug some locking issue with mutex who knows :D)
There was a problem hiding this comment.
Thinking about this made me slightly doubt whether we should do a periodic logging here based on an interval. I wonder if we should instead just allow to define two timeouts, one for debug (5s) and one for info logging (10s), and then do warn logging upon publication timeout (i.e. what we have here).
There was a problem hiding this comment.
I think an earlier debug log is of limited value, because people don't often run with debug logging on the offchance that it catches something interesting like this. I do agree that it seems sufficient to emit a single info message after 10 seconds, and then a warning after 30 seconds, so I've implemented that in 16fae08.
original-brownbear
left a comment
There was a problem hiding this comment.
LGTM, maybe fix the appender stop order but it shouldn't matter with the DeterministicTaskQueue I think.
| } | ||
| } | ||
| lagDetector.startLagDetector(publishRequest.getAcceptedState().version()); | ||
| logIncompleteNodes(); |
There was a problem hiding this comment.
@DaveCTurner that's not so bad imo, we'll see the thread this was logged from in the logs as well so it's not that weird? (put differently, it doesn't bother me :P and maybe by some random chance this will help debug some locking issue with mutex who knows :D)
| + defaultMillis(LagDetector.CLUSTER_FOLLOWER_LAG_TIMEOUT_SETTING), "waiting for warning to be emitted"); | ||
| mockLogAppenderForLagDetection.assertAllExpectationsMatched(); | ||
| } finally { | ||
| mockLogAppenderForLagDetection.stop(); |
There was a problem hiding this comment.
I think it doesn't matter here, but in general you should always remove appenders and then stop them to not get an error for trying to use a stopped appender? (same comment in the other log appender spots)
There was a problem hiding this comment.
Ok, thanks, adjusted in c9df59c. It doesn't matter here indeed, but a stop-before-remove in places with more concurrency is effectively an assertion that no more log messages are being emitted.
|
@elasticmachine please run elasticsearch-ci/2 |
ywelsch
left a comment
There was a problem hiding this comment.
Great stuff. I've left some comments and questions for discussion.
| TimeValue.timeValueMillis(30000), TimeValue.timeValueMillis(1), Setting.Property.NodeScope); | ||
|
|
||
| // the interval between reports of slow publication | ||
| public static final Setting<TimeValue> PUBLISH_REPORT_INTERVAL_SETTING = |
There was a problem hiding this comment.
the setting is not registered. Is that by design or omission?
There was a problem hiding this comment.
Omission :(
We really should have build-time checks for this.
| } | ||
| } | ||
| lagDetector.startLagDetector(publishRequest.getAcceptedState().version()); | ||
| logIncompleteNodes(); |
There was a problem hiding this comment.
Thinking about this made me slightly doubt whether we should do a periodic logging here based on an interval. I wonder if we should instead just allow to define two timeouts, one for debug (5s) and one for info logging (10s), and then do warn logging upon publication timeout (i.e. what we have here).
server/src/main/java/org/elasticsearch/cluster/coordination/LagDetector.java
Show resolved
Hide resolved
| protected void warnAboutSlowTaskIfNeeded(TimeValue executionTime, String source) { | ||
| if (executionTime.getMillis() > slowTaskLoggingThreshold.getMillis()) { | ||
| logger.warn("cluster state update task [{}] took [{}] which is above the warn threshold of {}", source, executionTime, | ||
| logger.warn("cluster state update task [{}] took [{}] which is above the warn threshold of [{}]", source, executionTime, |
There was a problem hiding this comment.
An interesting follow-up would be to change the slow logging in MasterService so that it only applies to calculateTaskOutputs, and then reduce the timeout from 30s to 10s. Currently, the slow logging in MasterService measures cluster state update calculation, publication + application. The latter two are covered by your PR here now
|
@elasticmachine please run elasticsearch-ci/2 |
Today the lag detector may remove nodes from the cluster if they fail to apply a cluster state within a reasonable timeframe, but it is rather unclear from the default logging that this has occurred and there is very little extra information beyond the fact that the removed node was lagging. Moreover the only forewarning that the lag detector might be invoked is a message indicating that cluster state publication took unreasonably long, which does not contain enough information to investigate the problem further. This commit adds a good deal more detail to make the issues of slow nodes more prominent: - after 10 seconds (by default) we log an INFO message indicating that a publication is still waiting for responses from some nodes, including the identities of the problematic nodes. - when the publication times out after 30 seconds (by default) we log a WARN message identifying the nodes that are still pending. - the lag detector logs a more detailed warning when a fatally-lagging node is detected. - if applying a cluster state takes too long then the cluster applier service logs a breakdown of all the tasks it ran as part of that process.
Adds a tighter threshold for logging a warning about slowness in the `MasterService` instead of relying on the cluster service's 30-second warning threshold. This new threshold applies to the computation of the cluster state update in isolation, so we get a warning if computing a new cluster state update takes longer than 10 seconds even if it is subsequently applied quickly. It also applies independently to the length of time it takes to notify the cluster state tasks on completion of publication, in case any of these notifications holds up the master thread for too long. Relates elastic#45007
Adds a tighter threshold for logging a warning about slowness in the `MasterService` instead of relying on the cluster service's 30-second warning threshold. This new threshold applies to the computation of the cluster state update in isolation, so we get a warning if computing a new cluster state update takes longer than 10 seconds even if it is subsequently applied quickly. It also applies independently to the length of time it takes to notify the cluster state tasks on completion of publication, in case any of these notifications holds up the master thread for too long. Relates elastic#45007
Adds a tighter threshold for logging a warning about slowness in the `MasterService` instead of relying on the cluster service's 30-second warning threshold. This new threshold applies to the computation of the cluster state update in isolation, so we get a warning if computing a new cluster state update takes longer than 10 seconds even if it is subsequently applied quickly. It also applies independently to the length of time it takes to notify the cluster state tasks on completion of publication, in case any of these notifications holds up the master thread for too long. Relates #45007
Adds a tighter threshold for logging a warning about slowness in the `MasterService` instead of relying on the cluster service's 30-second warning threshold. This new threshold applies to the computation of the cluster state update in isolation, so we get a warning if computing a new cluster state update takes longer than 10 seconds even if it is subsequently applied quickly. It also applies independently to the length of time it takes to notify the cluster state tasks on completion of publication, in case any of these notifications holds up the master thread for too long. Relates #45007 Backport of #45086
Today the lag detector may remove nodes from the cluster if they fail to apply
a cluster state within a reasonable timeframe, but it is rather unclear from
the default logging that this has occurred and there is very little extra
information beyond the fact that the removed node was lagging. Moreover the
only forewarning that the lag detector might be invoked is a message indicating
that cluster state publication took unreasonably long, which does not contain
enough information to investigate the problem further.
This commit adds a good deal more detail to make the issues of slow nodes more
prominent:
every 10 seconds (by default) we log a warning indicating that a publication
is still waiting for responses from some nodes, including the identities of
the problematic nodes.
the lag detector logs a more detailed warning when a fatally-lagging node is
detected.
if applying a cluster state takes too long then the cluster applier service
logs a breakdown of all the tasks it ran as part of that process.