H2/H2C server stream channels deactivated while write still in progress#13388
Conversation
|
it seems that the 5 cancelled checks are caused by the failure of linux-x86_64-java11-graal; so I have downloaded the logs from linux-x86_64-java11-graal check, but it seems that the failure comes from Netty/Handler, which is not related to my modifications: So, I don't know what to think for the moment; and I will need to investigate this before you can proceed with the review. |
|
@normanmaurer, could you manually restart only the 5 cancelled checks , in order to see how they run ? |
|
@pderop thanks a lot for the investigation and fix :) |
|
thanks Norman for the review and for the merge :-) I will also check if we have the same issue in Netty 5 (maybe this week if I can, or the next week). |
|
@pderop I think the answer is yes... I am working on a port of it. |
|
ok, thanks ! |
…ss (#13388) Motivation: In the context of Reactor-Netty, we experience an issue during HTTP2 (or H2C) load tests. We have an HTTP2 (or H2C, it does not matter) server which is using in the pipeline an Http2FrameCodec, followed by a Http2MultiplexHandler, and when the client sends a new request, then the Http2MultiplexHandler will create a child `Http2MultiplexHandlerStreamChannel` that will contain in its pipeline an Http2StreamFrameToHttpObjectCodec followed by some other specific reactor-netty handlers. Now, the issue is the following: when the client sends the last request frame (with EOS flag=true), and when the server sends the last response frame (also with EOS flag = true), then in the server the stream will be closed, and the stream channel handlers will be called in `channelInactive` (that is ok). But sometimes, under high load, when the last server response can't be flushed when the parent channel is non-writtable, then we see that the server stream channel handlers may be called in`channelInactive` while the server last response frame is still writing and is not yet flushed. In other words, the `ChannelFuture` of the last server response sent to the client is not "success", but is "incomplete" at the time the child stream channel handlers are called in `channelInactive`. Normally, if I'm correct, when a channel handler is called in `channelInactive`, it means that it is now inactive and has reached its end of lifetime. So when our handlers are called in `channelInactive` while the response is not yet flushed (because parent channel was non-writable), then then we are getting into troubles, because we are then trying to cleanup resources, like the buffer of the last server response, but a bit later, when the last response is now flushed, then the buffer will be freed again, and we end up with many IllegalReferenceCountExceptions. So, we think that `channelInactive` should only be invoked after the last response is fully written and released. To reproduce the issue and investigate it with the debugger, you can first run the `Http2MultiplexTransportTest.streamHandlerInactivatedResponseFlushed` from this PR, but without applying the patch. Only pay attention to the "serverloop" thread. I tend to think that the problem may start from `DefaultHttp2RemoteFlowController.FlowState.writeAllocatedBytes(int allocated)`, line 368, where `frame.writeComplete()` is called: this method will indirectly trigger `AbstractHttp2StreamChannel.fireChannelInactiveAndDeregister(), line 742`, but without waiting for the frame promise to complete (the promise is in `DefaultHttp2ConnectionEncoder.FlowControlledBase`). Modification: Added a reproducer test in`Http2MultiplexTransportTest.streamHandlerInactivatedResponseFlushed`test. This test simulates non-writable socket by configuring the SO_SNDBUF of the server side connection to `1`. This will trigger the issue, because when the server will respond, it will get many writability events and the response won't be flushed immediately. I have tried to apply the following simple patch which seems to resolve the problem: In. `Http2ConnectionHandler.closeStream` method, ensures that the stream is closed once the future is completed: instead of doing: ``` @OverRide public void closeStream(final Http2Stream stream, ChannelFuture future) { stream.close(); if (future.isDone()) { checkCloseConnection(future); } else { future.addListener(new ChannelFutureListener() { @OverRide public void operationComplete(ChannelFuture future) throws Exception { checkCloseConnection(future); } }); } } ``` the closing is then done in the future listener, like this: ``` Override public void closeStream(final Http2Stream stream, ChannelFuture future) { if (future.isDone()) { doCloseStream(stream, future); } else { future.addListener(f -> doCloseStream(stream, future)); } } private void doCloseStream(final Http2Stream stream, ChannelFuture future) { stream.close(); checkCloseConnection(future); } ``` This seems to resolve the issue, because the stream will be closed only once the last response frame has been fully flushed. The Http2ConnectionHandlerTest.canCloseStreamWithVoidPromise has also been modified in order to set the promise to success before doing the test: ``` @test public void canCloseStreamWithVoidPromise() throws Exception { handler = newHandler(); handler.closeStream(stream, ctx.voidPromise().setSuccess()); verify(stream, times(1)).close(); verifyNoMoreInteractions(stream); } ``` Result: Fixes reactor/reactor-netty#2760
|
done... |
|
+1, the change makes sense, great find @pderop! |
### What changes were proposed in this pull request? This pr aims to upgrade netty from 4.1.92 to 4.1.93. ### Why are the changes needed? 1.v4.1.92 VS v4.1.93 netty/netty@netty-4.1.92.Final...netty-4.1.93.Final 2.The new version brings some bug fix, eg: - Reset byte buffer in loop for AbstractDiskHttpData.setContent ([#13320](netty/netty#13320)) - OpenSSL MAX_CERTIFICATE_LIST_BYTES option supported ([#13365](netty/netty#13365)) - Adapt to DirectByteBuffer constructor in Java 21 ([#13366](netty/netty#13366)) - HTTP/2 encoder: allow HEADER_TABLE_SIZE greater than Integer.MAX_VALUE ([#13368](netty/netty#13368)) - Upgrade to latest netty-tcnative to fix memory leak ([#13375](netty/netty#13375)) - H2/H2C server stream channels deactivated while write still in progress ([#13388](netty/netty#13388)) - Channel#bytesBefore(un)writable off by 1 ([#13389](netty/netty#13389)) - HTTP/2 should forward shutdown user events to active streams ([#13394](netty/netty#13394)) - Respect the number of bytes read per datagram when using recvmmsg ([#13399](netty/netty#13399)) 3.The release notes as follows: - https://netty.io/news/2023/05/25/4-1-93-Final.html 4.Why not upgrade to `4-1-94-Final` version? Because the return value of the 'threadCache()' (from `PoolThreadCache` to `PoolArenasCache`) method of the netty Inner class used in the 'arrow memory netty' version '12.0.1' has changed and belongs to break change, let's wait for the upgrade of the 'arrow memory netty' before upgrading to the '4-1-94-Final' version. The reference is as follows: https://github.com/apache/arrow/blob/6af660f48472b8b45a5e01b7136b9b040b185eb1/java/memory/memory-netty/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java#L164 https://github.com/netty/netty/blob/da1a448d5bc4f36cc1744db93fcaf64e198db2bd/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java#L732-L736 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA. Closes #41681 from panbingkun/upgrade_netty. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Motivation:
In the context of Reactor-Netty, we experience an issue during HTTP2 (or H2C) load tests.
We have an HTTP2 (or H2C, it does not matter) server which is using in the pipeline an Http2FrameCodec, followed by a Http2MultiplexHandler, and when the client sends a new request, then the Http2MultiplexHandler will create a child
Http2MultiplexHandlerStreamChannelthat will contain in its pipeline an Http2StreamFrameToHttpObjectCodec followed by some other specific reactor-netty handlers.Now, the issue is the following: when the client sends the last request frame (with EOS flag=true), and when the server sends the last response frame (also with EOS flag = true), then in the server the stream will be closed, and the stream channel handlers will be called in
channelInactive(that is ok).But sometimes, under high load, when the last server response can't be flushed when the parent channel is non-writtable, then we see that the server stream channel handlers may be called in
channelInactivewhile the server last response frame is still writing and is not yet flushed.In other words, the
ChannelFutureof the last server response sent to the client is not "success", but is "incomplete" at the time the child stream channel handlers are called inchannelInactive.Normally, if I'm correct, when a channel handler is called in
channelInactive, it means that it is now inactive and has reached its end of lifetime. So when our handlers are called inchannelInactivewhile the response is not yet flushed (because parent channel was non-writable), then then we are getting into troubles, because we are then trying to cleanup resources, like the buffer of the last server response, but a bit later, when the last response is now flushed, then the buffer will be freed again, and we end up with many IllegalReferenceCountExceptions. So, we think thatchannelInactiveshould only be invoked after the last response is fully written and released.To reproduce the issue and investigate it with the debugger, you can first run the
Http2MultiplexTransportTest.streamHandlerInactivatedResponseFlushedfrom this PR, but without applying the patch. Only pay attention to the "serverloop" thread.I tend to think that the problem may start from
DefaultHttp2RemoteFlowController.FlowState.writeAllocatedBytes(int allocated), line 368, whereframe.writeComplete()is called: this method will indirectly triggerAbstractHttp2StreamChannel.fireChannelInactiveAndDeregister(), line 742, but without waiting for the frame promise to complete (the promise is inDefaultHttp2ConnectionEncoder.FlowControlledBase).Modification:
Added a reproducer test in
Http2MultiplexTransportTest.streamHandlerInactivatedResponseFlushedtest. This test simulates non-writable socket by configuring the SO_SNDBUF of the server side connection to1. This will trigger the issue, because when the server will respond, it will get many writability events and the response won't be flushed immediately.I have tried to apply the following simple patch which seems to resolve the problem: In.
Http2ConnectionHandler.closeStreammethod, ensures that the stream is closed once the future is completed:instead of doing:
the closing is then done in the future listener, like this:
This seems to resolve the issue, because the stream will be closed only once the last response frame has been fully flushed.
The Http2ConnectionHandlerTest.canCloseStreamWithVoidPromise has also been modified in order to set the promise to success before doing the test:
Result:
Fixes reactor/reactor-netty#2760
PS: if the validity of this issue is accepted, but the PR is not satisfactory, I'll be more than happy to attempt a new patch, please let me know.