Check for closed connection while opening#26932
Conversation
While opening a connection to a node, a channel can subsequently close. If this happens, a future callback whose purpose is to close all other channels and disconnect from the node will fire. However, this future will not be ready to close all the channels because the connection will not be exposed to the future callback yet. Since this callback is run once, we will never try to disconnect from this node again and we will be left with a closed channel. This commit adds a check that all channels are open before exposing the channel and throws a general connection exception. In this case, the usual connection retry logic will take over.
* master: update Lucene version for 6.0-RC2 version Calculate and cache result when advanceExact is called (elastic#26920) Test query builder bwc against previous supported versions instead of just the current version. Set minimum_master_nodes on rolling-upgrade test (elastic#26911) Return List instead of an array from settings (elastic#26903) remove _primary and _replica shard preferences (elastic#26791) fixing typo in datehistogram-aggregation.asciidoc (elastic#26924) [API] Added the `terminate_after` parameter to the REST spec for "Count" API Setup debug logging for qa.full-cluster-restart Enable BWC testing against other remotes
bleskes
left a comment
There was a problem hiding this comment.
Thx @jasontedor . I left one important comment.
| @Override | ||
| protected void close(NioChannel nioChannel) { | ||
| try { | ||
| nioChannel.getCloseFuture().awaitClose(); |
There was a problem hiding this comment.
Out of curiosity - this reads weird - we wait on close but not actively close? is that OK?
| }; | ||
| nodeChannels = connectToChannels(node, connectionProfile, onClose); | ||
| nodeChannels = connectToChannels(node, connectionProfile, this::onChannelOpen, onClose); | ||
| if (!Arrays.stream(nodeChannels.channels).allMatch(this::isOpen)) { |
There was a problem hiding this comment.
I this this to be moved to after connectionRef.set(nodeChannels); is called. Otherwise we still have a race condition when things go wrong later on?
|
@bleskes This is ready for you again. |
| nodeChannels = new NodeChannels(nodeChannels, version); // clone the channels - we now have the correct version | ||
| transportService.onConnectionOpened(nodeChannels); | ||
| connectionRef.set(nodeChannels); | ||
| if (!Arrays.stream(nodeChannels.channels).allMatch(this::isOpen)) { |
| * | ||
| * @param channel the opened channel | ||
| */ | ||
| protected void onChannelOpen(final Channel channel) { |
There was a problem hiding this comment.
I am not sure I understand this change. Why do we have to have this onChannelOpen callback? Is it for testing only? I wonder if we can find a different way to do this, it's a no-op in production
There was a problem hiding this comment.
Yes, it's for testing so we can hook in and quietly close a channel after it's been successfully opened to trigger the issue. I'm open to suggestions on an alternative approach to achieve the same, but this is the cleanest that I found.
There was a problem hiding this comment.
Yesterday I had tried to use the connection listener to hook in and disconnect; the problem with this is that it was not reliable because it happens on a background thread and would only sporadically succeed to close a channel in time. @s1monw had the clever and sneaky idea to hook into the executor so we can replace it with one that happens synchronously on the same thread and thus we are ensured it happens before the connection finishes. I have pushed this change.
Tim-Brooks
left a comment
There was a problem hiding this comment.
So generally LGTM be I did have one concern about if we had address all the potential issues. Also our method contracts are getting kind of complicated and we depend on both nio and netty properly implementing similar connection failure logic.
I'm not sure if we should maybe move to a different contact here at some point. Such as:
protected abstract List<Future<Channel>> connectToChannels(DiscoveryNode node,
ConnectionProfile connectionProfile,
Consumer<Channel> onChannelClose) throws IOException;
In this scenarios the list of futures returned are the connection futures.
And all the implementation does is initiate the connection process and attach the close listener. It is up to TcpTransport to block on a connection and do all the validation ensuring that the connections are setup properly. I'm not sure I've thought through all the implications of that. But it is something to consider in the future.
| final AtomicBoolean first = new AtomicBoolean(); | ||
| service = | ||
| buildService("service", version0, clusterSettings, Settings.EMPTY, true, true, channel -> { | ||
| if (!first.compareAndSet(false, true)) { |
There was a problem hiding this comment.
I understand that this works. But the booleans (as labelled) seem backwards to me.
Isn't it:
final AtomicBoolean first = new AtomicBoolean(true)
...
if (!first.compareAndSet(true, false)) {
}
?
| nodeChannels = new NodeChannels(nodeChannels, version); // clone the channels - we now have the correct version | ||
| transportService.onConnectionOpened(nodeChannels); | ||
| connectionRef.set(nodeChannels); | ||
| if (Arrays.stream(nodeChannels.channels).allMatch(this::isOpen) == false) { |
There was a problem hiding this comment.
I guess this solution solves the majority of the cases where the original issue arises but don't we still have an issue if this check occurs and immediately after it succeeds, concurrently the other side disconnects? The event loop is a different thread so the close listener could still be executed prior the nodes channels being returned from this scope.
In order to be safe don't we need to check if all the channels are still open AFTER we put it in the connectedNodes map in the connectToNode method?
There was a problem hiding this comment.
I think it's enough to check they are open after we've set the reference to the channels because the problem arises if we close before that reference is set; after that reference is set we are covered by the close listener?
There was a problem hiding this comment.
Oh yeah that’s probably right. I was thinking about the connection map that is not manipulated until later.
|
@tbrooks8 I think your suggestion for returning the futures and lifting the channel connection handling logic into |
|
@s1monw This is ready for you. |
* master: Fix handling of paths containing parentheses Allow only a fixed-size receive predictor (elastic#26165) Add Homebrew instructions to getting started ingest: Fix bug that prevent date_index_name processor from accepting timestamps specified as a json number Scripting: Fix expressions to temporarily support filter scripts (elastic#26824) Docs: Add note to contributing docs warning against tool based refactoring (elastic#26936) Fix thread context handling of headers overriding (elastic#26068) SearchWhileCreatingIndexIT: remove usage of _only_nodes
|
I discussed with @bleskes via another channel and he is okay with this being merged. |
While opening a connection to a node, a channel can subsequently close. If this happens, a future callback whose purpose is to close all other channels and disconnect from the node will fire. However, this future will not be ready to close all the channels because the connection will not be exposed to the future callback yet. Since this callback is run once, we will never try to disconnect from this node again and we will be left with a closed channel. This commit adds a check that all channels are open before exposing the channel and throws a general connection exception. In this case, the usual connection retry logic will take over. Relates #26932
While opening a connection to a node, a channel can subsequently close. If this happens, a future callback whose purpose is to close all other channels and disconnect from the node will fire. However, this future will not be ready to close all the channels because the connection will not be exposed to the future callback yet. Since this callback is run once, we will never try to disconnect from this node again and we will be left with a closed channel. This commit adds a check that all channels are open before exposing the channel and throws a general connection exception. In this case, the usual connection retry logic will take over. Relates #26932
While opening a connection to a node, a channel can subsequently close. If this happens, a future callback whose purpose is to close all other channels and disconnect from the node will fire. However, this future will not be ready to close all the channels because the connection will not be exposed to the future callback yet. Since this callback is run once, we will never try to disconnect from this node again and we will be left with a closed channel. This commit adds a check that all channels are open before exposing the channel and throws a general connection exception. In this case, the usual connection retry logic will take over. Relates #26932
While opening a connection to a node, a channel can subsequently close. If this happens, a future callback whose purpose is to close all other channels and disconnect from the node will fire. However, this future will not be ready to close all the channels because the connection will not be exposed to the future callback yet. Since this callback is run once, we will never try to disconnect from this node again and we will be left with a closed channel. This commit adds a check that all channels are open before exposing the channel and throws a general connection exception. In this case, the usual connection retry logic will take over.