Increase PeerFinder verbosity on persistent failure#73128
Merged
DaveCTurner merged 4 commits intoelastic:masterfrom May 17, 2021
Merged
Conversation
If a node is partitioned away from the rest of the cluster then the `ClusterFormationFailureHelper` periodically reports that it cannot discover the expected collection of nodes, but does not indicate why. To prove it's a connectivity problem, users must today restart the node with `DEBUG` logging on `org.elasticsearch.discovery.PeerFinder` to see further details. With this commit we log messages at `WARN` level if the node remains disconnected for longer than a configurable timeout, which defaults to 5 minutes. Relates elastic#72968
Collaborator
|
Pinging @elastic/es-distributed (Team:Distributed) |
henningandersen
approved these changes
May 17, 2021
Contributor
henningandersen
left a comment
There was a problem hiding this comment.
LGTM, thanks for improving this.
| "connection failed", | ||
| "org.elasticsearch.discovery.PeerFinder", | ||
| Level.DEBUG, | ||
| "*connection failed*")); |
Contributor
There was a problem hiding this comment.
nit: now that we validate the message, it would be nice to show that it contains both the transport address and the exception message cannot connect to.
| "connection failed", | ||
| "org.elasticsearch.discovery.PeerFinder", | ||
| Level.WARN, | ||
| "*connection failed: cannot connect to*")); |
Contributor
There was a problem hiding this comment.
nit: now that we validate the message, it would be nice to show that it contains the transport address.
Member
Author
|
Thanks Henning |
DaveCTurner
added a commit
that referenced
this pull request
May 17, 2021
If a node is partitioned away from the rest of the cluster then the `ClusterFormationFailureHelper` periodically reports that it cannot discover the expected collection of nodes, but does not indicate why. To prove it's a connectivity problem, users must today restart the node with `DEBUG` logging on `org.elasticsearch.discovery.PeerFinder` to see further details. With this commit we log messages at `WARN` level if the node remains disconnected for longer than a configurable timeout, which defaults to 5 minutes. Relates #72968
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this pull request
Jan 27, 2022
Since elastic#73128 a sufficiently old `PeerFinder` will report all exceptions encountered during discovery to help diagnose cluster formation problems. We throw exceptions on genuine connection failures, but we also throw exceptions if the discovered node is the local node or is master-ineligible because these nodes are no use in discovery. We report all such exceptions as failures: [instance-0000000001] address [10.0.0.1:12345], node [null], requesting [false] connection failed: [instance-0000000002][10.0.0.1:12345] non-master-eligible node found Experience shows that users often have master-ineligible nodes in their discovery config so will see these messages frequently if the cluster cannot form, and may interpret the `connection failed` as the source of the problems even though they're benign. This commit adjusts the language in these messages to be more balanced, replacing `connection failed` with `discovery result`, including the phrase `successfully discovered` in the exception messsage, and giving advice on how to suppress the message.
DaveCTurner
added a commit
that referenced
this pull request
Jan 28, 2022
Since #73128 a sufficiently old `PeerFinder` will report all exceptions encountered during discovery to help diagnose cluster formation problems. We throw exceptions on genuine connection failures, but we also throw exceptions if the discovered node is the local node or is master-ineligible because these nodes are no use in discovery. We report all such exceptions as failures: [instance-0000000001] address [10.0.0.1:12345], node [null], requesting [false] connection failed: [instance-0000000002][10.0.0.1:12345] non-master-eligible node found Experience shows that users often have master-ineligible nodes in their discovery config so will see these messages frequently if the cluster cannot form, and may interpret the `connection failed` as the source of the problems even though they're benign. This commit adjusts the language in these messages to be more balanced, replacing `connection failed` with `discovery result`, including the phrase `successfully discovered` in the exception messsage, and giving advice on how to suppress the message.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
If a node is partitioned away from the rest of the cluster then the
ClusterFormationFailureHelperperiodically reports that it cannotdiscover the expected collection of nodes, but does not indicate why. To
prove it's a connectivity problem, users must today restart the node
with
DEBUGlogging onorg.elasticsearch.discovery.PeerFinderto seefurther details.
With this commit we log messages at
WARNlevel if the node remainsdisconnected for longer than a configurable timeout, which defaults to 5
minutes.
Relates #72968