We have a serialization bug somewhere in the stats serialization code. I've now seen five six independent reports (2, 4, 5 and two three more that are not linkable) of:
[2016-12-12T09:26:50,081][WARN ][o.e.t.n.Netty4Transport ] [...] exception caught on transport layer [[id: 0xcbdaf621, L:/...:35678 - R:.../...:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [...], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1@44aa70c], error [false]; resetting
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1257) ~[elasticsearch-5.1.1.jar:5.1.1]
and the related
Caused by: java.io.EOFException: tried to read: 91755306 bytes but only 114054 remaining
and
Caused by: java.lang.IllegalStateException: No routing state mapped for [103]
at org.elasticsearch.cluster.routing.ShardRoutingState.fromValue(ShardRoutingState.java:71) ~[elasticsearch-5.1.1.jar:5.1.1]
It seems to always be in some stats response, either a node stats response, or a cluster stats response and it's coming from TransportBroadcastByNodeAction and the single action defined by a lambda in TransportNodesAction$AsyncAction. We are blowing reading the stream somewhere and then reading garbage subsequently.
Whatever it is, it's pesky. So far, there is not a reliable reproduction and finding the bug is tricky since these responses serialize the entire world.
The first instance of this led to #21478 so that we know the handler name, #22152 so we can detect corruption earlier, and #22223 to clean up some serialization code. Right now, I do not think we've squashed the issue.
We have a serialization bug somewhere in the stats serialization code. I've now seen
fivesix independent reports (2, 4, 5 andtwothree more that are not linkable) of:and the related
and
It seems to always be in some stats response, either a node stats response, or a cluster stats response and it's coming from
TransportBroadcastByNodeActionand the single action defined by a lambda inTransportNodesAction$AsyncAction. We are blowing reading the stream somewhere and then reading garbage subsequently.Whatever it is, it's pesky. So far, there is not a reliable reproduction and finding the bug is tricky since these responses serialize the entire world.
The first instance of this led to #21478 so that we know the handler name, #22152 so we can detect corruption earlier, and #22223 to clean up some serialization code. Right now, I do not think we've squashed the issue.