Skip to content

Do not reply to pings from another cluster#21894

Merged
jasontedor merged 1 commit intoelastic:masterfrom
jasontedor:unicast-zen-ping-race
Nov 30, 2016
Merged

Do not reply to pings from another cluster#21894
jasontedor merged 1 commit intoelastic:masterfrom
jasontedor:unicast-zen-ping-race

Conversation

@jasontedor
Copy link
Copy Markdown
Member

@jasontedor jasontedor commented Nov 30, 2016

Today when sending responses to discovery pings, we unconditionally
reply. Instead, this commit modifies the response handler to not reply
when the cluster names do not match.

This addresses a race condition identified after reducing the timeout in
UnicastZenPingTests#testSimplePings. In particular, we send pings in the
following way:

  • if not connected to the node, connect to the node and after
    successful handshake, send a ping
  • if connected to the node, send a ping

When the ping timeout is set low, a subsequent batch of pings can race
against a connect/disconnect cycle from a prior batch of pings. In
particular, consider the following scenario:

  • node A from cluster X
  • node B from cluster Y
  • pings are initiated from node A with node B in the hosts list
  • node A will try to connect and handshake with B
  • the connection will succeed, and the handshake will eventually fail due to mismatched cluster names
  • on a short timeout, a second batch of pings will fire, and on this
    batch node A will see that it is still connected to node B; thus, it
    will immediately fire a ping to node B and node B will dutifully
    respond

Relates #21874

@jasontedor jasontedor added :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure review v5.2.0 v6.0.0-alpha1 labels Nov 30, 2016
@jasontedor jasontedor force-pushed the unicast-zen-ping-race branch from c89e6c3 to 8ecd370 Compare November 30, 2016 19:11
Today when sending responses to discovery pings, we unconditionally
reply. Instead, this commit modifies the response handler to not reply
when the cluster names do not match.

This addresses a race condition identified after reducing the timeout in
UnicastZenPingTests#testSimplePings. In particular, we send pings in the
following way:
 - if not connected to the node, connect to the node and after
   successful handshake, send a ping
 - if connected to the node, send a ping

When the ping timeout is set low, a subsequent batch of pings can race
against a connect/disconnect cycle from a prior batch of pings. In
particular, consider the following scenario:
 - node A from cluster X
 - node B from cluster Y
 - pings are initiated from node A with node B in the hosts list
 - node A will try to connect and handshake with B
 - the connection will succeed, and the handshake will eventually fail due to mismatched cluster names
 - on a short timeout, a second batch of pings will fire, and on this
   batch node A will see that it is still connected to node B; thus, it
   will immediately fire a ping to node B and node B will dutifully
   respond
@jasontedor jasontedor force-pushed the unicast-zen-ping-race branch from 8ecd370 to 9738584 Compare November 30, 2016 19:23
@jasontedor jasontedor merged commit c90ba67 into elastic:master Nov 30, 2016
jasontedor added a commit that referenced this pull request Nov 30, 2016
Today when sending responses to discovery pings, we unconditionally
reply. Instead, this commit modifies the response handler to not reply
when the cluster names do not match.

This addresses a race condition identified after reducing the timeout in
UnicastZenPingTests#testSimplePings. In particular, we send pings in the
following way:
 - if not connected to the node, connect to the node and after
   successful handshake, send a ping
 - if connected to the node, send a ping

When the ping timeout is set low, a subsequent batch of pings can race
against a connect/disconnect cycle from a prior batch of pings. In
particular, consider the following scenario:
 - node A from cluster X
 - node B from cluster Y
 - pings are initiated from node A with node B in the hosts list
 - node A will try to connect and handshake with B
 - the connection will succeed, and the handshake will eventually fail due to mismatched cluster names
 - on a short timeout, a second batch of pings will fire, and on this
   batch node A will see that it is still connected to node B; thus, it
   will immediately fire a ping to node B and node B will dutifully
   respond

Relates #21894
if (node.equals(localNode)) {
return localNode;
}
logger.trace("connecting with node [{}] to perform handshake", node);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need these traces?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 92f05e7.

channel.sendResponse(handlePingRequest(request));
if (request.pingResponse.clusterName().equals(clusterName)) {
channel.sendResponse(handlePingRequest(request));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we throw an exception instead? The channel won't be closed otherwise?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @s1monw; I will push directly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 761325b.

@s1monw
Copy link
Copy Markdown
Contributor

s1monw commented Nov 30, 2016

@jasontedor I added a bunch of comments I don't think we should not reply

jasontedor added a commit that referenced this pull request Nov 30, 2016
Today when sending responses to discovery pings, we unconditionally
reply. Instead, this commit modifies the response handler to not reply
when the cluster names do not match.

This addresses a race condition identified after reducing the timeout in
UnicastZenPingTests#testSimplePings. In particular, we send pings in the
following way:
 - if not connected to the node, connect to the node and after
   successful handshake, send a ping
 - if connected to the node, send a ping

When the ping timeout is set low, a subsequent batch of pings can race
against a connect/disconnect cycle from a prior batch of pings. In
particular, consider the following scenario:
 - node A from cluster X
 - node B from cluster Y
 - pings are initiated from node A with node B in the hosts list
 - node A will try to connect and handshake with B
 - the connection will succeed, and the handshake will eventually fail due to mismatched cluster names
 - on a short timeout, a second batch of pings will fire, and on this
   batch node A will see that it is still connected to node B; thus, it
   will immediately fire a ping to node B and node B will dutifully
   respond

Relates #21894
@jasontedor jasontedor deleted the unicast-zen-ping-race branch November 30, 2016 21:14
@jasontedor
Copy link
Copy Markdown
Member Author

Thanks @abeyad and @s1monw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure v5.2.0 v6.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants