TransportService.connectToNode should validate remote node ID#22828
TransportService.connectToNode should validate remote node ID#22828bleskes merged 37 commits intoelastic:masterfrom
Conversation
|
test this please |
s1monw
left a comment
There was a problem hiding this comment.
left some suggestions no blockers but I'd love them to be addressed.... no need for another review
|
|
||
| /** | ||
| * we try to reuse existing connections but if needed we will open a temporary connection | ||
| * that nodes to be closed at the end of execution. |
There was a problem hiding this comment.
s/that nodes to be closed at the end of execution./to a node that will be closed at the end of the excution?
| logger.info( | ||
| (Supplier<?>) () -> new ParameterizedMessage( | ||
| "failed to get local cluster state for {}, disconnecting...", nodeToPing), e); | ||
| latch.countDown(); |
There was a problem hiding this comment.
I am not sure but should we count down the latch once we notified the listener? I think it would be the better semantic.... maybe in a finally block?
There was a problem hiding this comment.
yeah, I tried to avoid the extra bloat of a finally block, but I agree.
| private final PublishClusterStateAction publishClusterState; | ||
| private final MembershipAction membership; | ||
| private final ThreadPool threadPool; | ||
|
|
| } catch (Exception e) { | ||
| logger.warn((Supplier<?>) () -> new ParameterizedMessage("failed to send rejoin request to [{}]", otherMaster), e); | ||
| } | ||
| // spawn to a background thread to not do blocking operations on |
There was a problem hiding this comment.
// spawn to a background thread to not do blocking operations on .. on? on what?
There was a problem hiding this comment.
spawn to a background thread to not do blocking operations on the cluster state thread
thx.
|
|
||
| /** create an empty builder */ | ||
| public Builder() { | ||
|
|
| throw new ConnectTransportException(node, "general node connection failure", e); | ||
| } finally { | ||
| if (success == false) { | ||
| connectedNodes.remove(node); |
There was a problem hiding this comment.
I think you can move this to the inner catch then you don't need to remove it from the connectedNodes....
There was a problem hiding this comment.
yeah, that's a good one. I made it such so that exceptions on the transportServiceAdapter.onNodeConnected(node); will mean an unsuccessful connection, as will any exception that bubbles out of this method.
There was a problem hiding this comment.
I think that's pretty much illegal it should not throw any and it should be handled internally
|
|
||
| import org.elasticsearch.test.ESTestCase; | ||
|
|
||
| public class TransportServiceTests extends ESTestCase { |
There was a problem hiding this comment.
hmm what is this? :D you just wanna add some lines of code I guess 💃
There was a problem hiding this comment.
hehe. I started to add a test and then noticed that AbstractSimpleTransportTestCase has all the infra. Will clean this up.
| */ | ||
| void connectToNode(DiscoveryNode node, ConnectionProfile connectionProfile) throws ConnectTransportException; | ||
| void connectToNode(DiscoveryNode node, ConnectionProfile connectionProfile, | ||
| CheckedBiConsumer<Connection, ConnectionProfile, IOException> connectionValidator) throws ConnectTransportException; |
There was a problem hiding this comment.
this must be non-null correct?
There was a problem hiding this comment.
yeah, I think this deep in the infra it's better to have clarity? it's easy enough to pass a no-op from tests and it production code we always pass something.
…en processed and the NodeConnectionService connect to the new master This removes the need to explicitly connect to the master, which triggers an assertion due to the blocking operation on the cluster state thread. Relates to elastic#22828
…en processed and the NodeConnectionService connect to the new master (#23037) After the first cluster state from a new master is processed, NodeConnectionService guarantees we connect to the new master. This removes the need to explicitly connect to the master in the MasterFaultDetection code making it simpler and bypasses the assertion triggered due to the blocking operation on the cluster state thread. Relates to #22828
When a node receives a new cluster state from the master, it opens up connections to any new node in the cluster state. That has always been done serially on the cluster state thread but it has been a long standing TODO to do this concurrently, which is done by this PR. This is spin off of #22828, where an extra handshake is done whenever connecting to a node, which may slow down connecting. Also, the handshake is done in a blocking fashion which triggers assertions w.r.t blocking requests on the cluster state thread. Instead of adding an exception, I opted to implement concurrent connections which both side steps the assertion and compensates for the extra handshake.
#22194 gave us the ability to open low level temporary connections to remote node based on their address. With this use case out of the way, actual full blown connections should validate the node on the other side, making sure we speak to who we think we speak to. This helps in case where multiple nodes are started on the same host and a quick node restart causes them to swap addresses, which in turn can cause confusion down the road.
…TestCase & Netty3ScheduledPingTests broken by backport of #22828
…en processed and the NodeConnectionService connect to the new master (#23037) After the first cluster state from a new master is processed, NodeConnectionService guarantees we connect to the new master. This removes the need to explicitly connect to the master in the MasterFaultDetection code making it simpler and bypasses the assertion triggered due to the blocking operation on the cluster state thread. Relates to #22828
#22194 gave us the ability to open low level temporary connections to remote node based on their address. With this use case out of the way, actual full blown connections should validate the node on the other side, making sure we speak to who we think we speak to. This helps in case where multiple nodes are started on the same host and a quick node restart causes them to swap addresses, which in turn can cause confusion down the road.