More logging for slow cluster state application by DaveCTurner · Pull Request #45007 · elastic/elasticsearch

DaveCTurner · 2019-07-30T14:21:30Z

Today the lag detector may remove nodes from the cluster if they fail to apply
a cluster state within a reasonable timeframe, but it is rather unclear from
the default logging that this has occurred and there is very little extra
information beyond the fact that the removed node was lagging. Moreover the
only forewarning that the lag detector might be invoked is a message indicating
that cluster state publication took unreasonably long, which does not contain
enough information to investigate the problem further.

This commit adds a good deal more detail to make the issues of slow nodes more
prominent:

every 10 seconds (by default) we log a warning indicating that a publication
is still waiting for responses from some nodes, including the identities of
the problematic nodes.
the lag detector logs a more detailed warning when a fatally-lagging node is
detected.
if applying a cluster state takes too long then the cluster applier service
logs a breakdown of all the tasks it ran as part of that process.

Today the lag detector may remove nodes from the cluster if they fail to apply a cluster state within a reasonable timeframe, but it is rather unclear from the default logging that this has occurred and there is very little extra information beyond the fact that the removed node was lagging. Moreover the only forewarning that the lag detector might be invoked is a message indicating that cluster state publication took unreasonably long, which does not contain enough information to investigate the problem further. This commit adds a good deal more detail to make the issues of slow nodes more prominent: - every 10 seconds (by default) we log a warning indicating that a publication is still waiting for responses from some nodes, including the identities of the problematic nodes. - the lag detector logs a more detailed warning when a fatally-lagging node is detected. - if applying a cluster state takes too long then the cluster applier service logs a breakdown of all the tasks it ran as part of that process.

elasticmachine · 2019-07-30T14:21:33Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-07-30T14:22:49Z

The one-line log message from the ClusterApplierService is informative but not especially pretty, because essentially no appliers or listeners implement a sensible toString() method:

[2019-07-30T14:42:47,259][WARN ][o.e.c.s.ClusterApplierService] [node-1] cluster state applier task [ApplyCommitRequest{term=1, version=2, sourceNode={node-0}{Ly-ZNAXgScKxPEjMyIsDmw}{sDmj2lBEQxaKD-iBA5bHEw}{127.0.0.1}{127.0.0.1:9300}{dim}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true}}] took [4.8s] which is above the warn threshold of [3s]: [running task [ApplyCommitRequest{term=1, version=2, sourceNode={node-0}{Ly-ZNAXgScKxPEjMyIsDmw}{sDmj2lBEQxaKD-iBA5bHEw}{127.0.0.1}{127.0.0.1:9300}{dim}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true}}]] took [0ms], [connecting to new nodes] took [3ms], [running applier [org.elasticsearch.indices.cluster.IndicesClusterStateService@7b93039f]] took [4779ms], [running applier [org.elasticsearch.script.ScriptService@6a7b0bcb]] took [0ms], [running applier [org.elasticsearch.xpack.ilm.IndexLifecycleService@437aa918]] took [0ms], [running applier [org.elasticsearch.repositories.RepositoriesService@34c89ab9]] took [0ms], [running applier [org.elasticsearch.snapshots.RestoreService@57a4e844]] took [0ms], [running applier [org.elasticsearch.ingest.IngestService@65338428]] took [0ms], [running applier [org.elasticsearch.action.ingest.IngestActionForwarder@7317d3ff]] took [0ms], [running applier [org.elasticsearch.tasks.TaskManager@37a9276]] took [0ms], [running applier [org.elasticsearch.snapshots.SnapshotsService@736cbc95]] took [0ms], [notifying listener [org.elasticsearch.cluster.InternalClusterInfoService@6e29ba2]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@37f478be]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@66534e86]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.authc.TokenService$$Lambda$1901/0x000000080182a840@16767295]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$$Lambda$1970/0x000000080185c840@4dbd472]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.support.WatcherIndexTemplateRegistry@10fe6ddf]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherLifeCycleService@6c253638]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherIndexingListener@49e213f3]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager@134036f9]] took [3ms], [notifying listener [org.elasticsearch.xpack.ml.datafeed.DatafeedManager$TaskRunner@67acb2d6]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlInitializationService@181c15e2]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlAssignmentNotifier@7165f508]] took [0ms], [notifying listener [org.elasticsearch.xpack.ilm.IndexLifecycleService@437aa918]] took [0ms], [notifying listener [org.elasticsearch.xpack.core.slm.history.SnapshotLifecycleTemplateRegistry@5dfb4a39]] took [0ms], [notifying listener [org.elasticsearch.xpack.slm.SnapshotLifecycleService@5f437952]] took [0ms], [notifying listener [org.elasticsearch.xpack.ccr.action.ShardFollowTaskCleaner@2708d45d]] took [0ms], [notifying listener [org.elasticsearch.cluster.metadata.TemplateUpgradeService@799ba801]] took [0ms], [notifying listener [org.elasticsearch.node.ResponseCollectorService@7eb68f89]] took [0ms], [notifying listener [org.elasticsearch.snapshots.SnapshotShardsService@3afbf0e4]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.action.TransportOpenJobAction$OpenJobPersistentTasksExecutor$$Lambda$2943/0x0000000801a9c840@5b3061da]] took [0ms], [notifying listener [org.elasticsearch.persistent.PersistentTasksClusterService@7f4f111d]] took [0ms], [notifying listener [org.elasticsearch.cluster.routing.DelayedAllocationService@35cf03ba]] took [0ms], [notifying listener [org.elasticsearch.indices.store.IndicesStore@5056b937]] took [0ms], [notifying listener [org.elasticsearch.gateway.DanglingIndicesState@8b03c56]] took [0ms], [notifying listener [org.elasticsearch.persistent.PersistentTasksNodeService@ff013a4]] took [0ms], [notifying listener [org.elasticsearch.license.LicenseService@42ef1333]] took [0ms], [notifying listener [org.elasticsearch.xpack.monitoring.exporter.local.LocalExporter@27959558]] took [17ms], [notifying listener [org.elasticsearch.xpack.ccr.action.AutoFollowCoordinator@164e5b2]] took [0ms], [notifying listener [org.elasticsearch.gateway.GatewayService@516305d]] took [0ms], [notifying listener [org.elasticsearch.cluster.service.ClusterApplierService$LocalNodeMasterListeners@4b9fa6a8]] took [0ms], [notifying listener [org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener@355a1a6b]] took [0ms], [notifying listener [org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener@68b69ac7]] took [0ms]

DaveCTurner · 2019-07-30T14:25:33Z

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

                                        }
                                    }
                                    lagDetector.startLagDetector(publishRequest.getAcceptedState().version());
+                                    logIncompleteNodes();


Because 10s divides 30s exactly, there is a chance that by default sometimes we'll log the third periodic warning and then the end-of-publication warning in quick succession. We could just not log the end-of-publication warning here.

@DaveCTurner that's not so bad imo, we'll see the thread this was logged from in the logs as well so it's not that weird? (put differently, it doesn't bother me :P and maybe by some random chance this will help debug some locking issue with mutex who knows :D)

Thinking about this made me slightly doubt whether we should do a periodic logging here based on an interval. I wonder if we should instead just allow to define two timeouts, one for debug (5s) and one for info logging (10s), and then do warn logging upon publication timeout (i.e. what we have here).

I think an earlier debug log is of limited value, because people don't often run with debug logging on the offchance that it catches something interesting like this. I do agree that it seems sufficient to emit a single info message after 10 seconds, and then a warning after 30 seconds, so I've implemented that in 16fae08.

original-brownbear

LGTM, maybe fix the appender stop order but it shouldn't matter with the DeterministicTaskQueue I think.

original-brownbear · 2019-07-30T17:29:05Z

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

                                        }
                                    }
                                    lagDetector.startLagDetector(publishRequest.getAcceptedState().version());
+                                    logIncompleteNodes();


@DaveCTurner that's not so bad imo, we'll see the thread this was logged from in the logs as well so it's not that weird? (put differently, it doesn't bother me :P and maybe by some random chance this will help debug some locking issue with mutex who knows :D)

original-brownbear · 2019-07-30T17:33:02Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

+                    + defaultMillis(LagDetector.CLUSTER_FOLLOWER_LAG_TIMEOUT_SETTING), "waiting for warning to be emitted");
+                mockLogAppenderForLagDetection.assertAllExpectationsMatched();
+            } finally {
+                mockLogAppenderForLagDetection.stop();


I think it doesn't matter here, but in general you should always remove appenders and then stop them to not get an error for trying to use a stopped appender? (same comment in the other log appender spots)

Ok, thanks, adjusted in c9df59c. It doesn't matter here indeed, but a stop-before-remove in places with more concurrency is effectively an assertion that no more log messages are being emitted.

DaveCTurner · 2019-07-31T07:44:30Z

@elasticmachine please run elasticsearch-ci/2

ywelsch

Great stuff. I've left some comments and questions for discussion.

ywelsch · 2019-07-31T08:08:51Z

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

            TimeValue.timeValueMillis(30000), TimeValue.timeValueMillis(1), Setting.Property.NodeScope);

+    // the interval between reports of slow publication
+    public static final Setting<TimeValue> PUBLISH_REPORT_INTERVAL_SETTING =


the setting is not registered. Is that by design or omission?

Omission :(

We really should have build-time checks for this.

ywelsch · 2019-07-31T08:25:49Z

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

                                        }
                                    }
                                    lagDetector.startLagDetector(publishRequest.getAcceptedState().version());
+                                    logIncompleteNodes();


Thinking about this made me slightly doubt whether we should do a periodic logging here based on an interval. I wonder if we should instead just allow to define two timeouts, one for debug (5s) and one for info logging (10s), and then do warn logging upon publication timeout (i.e. what we have here).

server/src/main/java/org/elasticsearch/cluster/coordination/LagDetector.java

ywelsch · 2019-07-31T08:39:10Z

server/src/main/java/org/elasticsearch/cluster/service/MasterService.java

    protected void warnAboutSlowTaskIfNeeded(TimeValue executionTime, String source) {
        if (executionTime.getMillis() > slowTaskLoggingThreshold.getMillis()) {
-            logger.warn("cluster state update task [{}] took [{}] which is above the warn threshold of {}", source, executionTime,
+            logger.warn("cluster state update task [{}] took [{}] which is above the warn threshold of [{}]", source, executionTime,


An interesting follow-up would be to change the slow logging in MasterService so that it only applies to calculateTaskOutputs, and then reduce the timeout from 30s to 10s. Currently, the slow logging in MasterService measures cluster state update calculation, publication + application. The latter two are covered by your PR here now

++ good idea

DaveCTurner · 2019-07-31T13:05:56Z

@elasticmachine please run elasticsearch-ci/2

ywelsch

LGTM

Today the lag detector may remove nodes from the cluster if they fail to apply a cluster state within a reasonable timeframe, but it is rather unclear from the default logging that this has occurred and there is very little extra information beyond the fact that the removed node was lagging. Moreover the only forewarning that the lag detector might be invoked is a message indicating that cluster state publication took unreasonably long, which does not contain enough information to investigate the problem further. This commit adds a good deal more detail to make the issues of slow nodes more prominent: - after 10 seconds (by default) we log an INFO message indicating that a publication is still waiting for responses from some nodes, including the identities of the problematic nodes. - when the publication times out after 30 seconds (by default) we log a WARN message identifying the nodes that are still pending. - the lag detector logs a more detailed warning when a fatally-lagging node is detected. - if applying a cluster state takes too long then the cluster applier service logs a breakdown of all the tasks it ran as part of that process.

Adds a tighter threshold for logging a warning about slowness in the `MasterService` instead of relying on the cluster service's 30-second warning threshold. This new threshold applies to the computation of the cluster state update in isolation, so we get a warning if computing a new cluster state update takes longer than 10 seconds even if it is subsequently applied quickly. It also applies independently to the length of time it takes to notify the cluster state tasks on completion of publication, in case any of these notifications holds up the master thread for too long. Relates elastic#45007

Adds a tighter threshold for logging a warning about slowness in the `MasterService` instead of relying on the cluster service's 30-second warning threshold. This new threshold applies to the computation of the cluster state update in isolation, so we get a warning if computing a new cluster state update takes longer than 10 seconds even if it is subsequently applied quickly. It also applies independently to the length of time it takes to notify the cluster state tasks on completion of publication, in case any of these notifications holds up the master thread for too long. Relates #45007

Adds a tighter threshold for logging a warning about slowness in the `MasterService` instead of relying on the cluster service's 30-second warning threshold. This new threshold applies to the computation of the cluster state update in isolation, so we get a warning if computing a new cluster state update takes longer than 10 seconds even if it is subsequently applied quickly. It also applies independently to the length of time it takes to notify the cluster state tasks on completion of publication, in case any of these notifications holds up the master thread for too long. Relates #45007 Backport of #45086

DaveCTurner added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.4.0 labels Jul 30, 2019

DaveCTurner requested review from original-brownbear and ywelsch July 30, 2019 14:21

DaveCTurner commented Jul 30, 2019

View reviewed changes

Brackets around warn threshold

8ccb02a

original-brownbear approved these changes Jul 30, 2019

View reviewed changes

DaveCTurner added 2 commits July 31, 2019 08:00

Fix MockLogAppender lifecycle

c9df59c

Merge branch 'master' into 2019-07-30-lagging-logging

03060f3

ywelsch reviewed Jul 31, 2019

View reviewed changes

New idea: only log once, not periodically

16fae08

DaveCTurner requested a review from ywelsch July 31, 2019 13:58

ywelsch approved these changes Aug 1, 2019

View reviewed changes

DaveCTurner merged commit 7776f75 into elastic:master Aug 1, 2019

DaveCTurner deleted the 2019-07-30-lagging-logging branch August 1, 2019 07:21

DaveCTurner mentioned this pull request Aug 1, 2019

Improve slow logging in MasterService #45086

Merged

DaveCTurner mentioned this pull request Aug 6, 2019

Improve slow logging in MasterService #45241

Merged

mfussenegger mentioned this pull request Mar 26, 2020

ES Backports crate/crate#9796

Closed

37 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Conversation

DaveCTurner commented Jul 30, 2019

Uh oh!

elasticmachine commented Jul 30, 2019

Uh oh!

DaveCTurner commented Jul 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Jul 31, 2019

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Jul 31, 2019

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DaveCTurner commented Jul 30, 2019 •

edited

Loading