Skip to content

H2/H2C server stream channels deactivated while write still in progress#13388

Merged
normanmaurer merged 4 commits into
netty:4.1from
pderop:4.1.x-h2stream-inactivate-fired-while-write-still-in-progress
May 22, 2023
Merged

H2/H2C server stream channels deactivated while write still in progress#13388
normanmaurer merged 4 commits into
netty:4.1from
pderop:4.1.x-h2stream-inactivate-fired-while-write-still-in-progress

Conversation

@pderop

@pderop pderop commented May 19, 2023

Copy link
Copy Markdown
Contributor

Motivation:

In the context of Reactor-Netty, we experience an issue during HTTP2 (or H2C) load tests.
We have an HTTP2 (or H2C, it does not matter) server which is using in the pipeline an Http2FrameCodec, followed by a Http2MultiplexHandler, and when the client sends a new request, then the Http2MultiplexHandler will create a child Http2MultiplexHandlerStreamChannel that will contain in its pipeline an Http2StreamFrameToHttpObjectCodec followed by some other specific reactor-netty handlers.

Now, the issue is the following: when the client sends the last request frame (with EOS flag=true), and when the server sends the last response frame (also with EOS flag = true), then in the server the stream will be closed, and the stream channel handlers will be called in channelInactive (that is ok).
But sometimes, under high load, when the last server response can't be flushed when the parent channel is non-writtable, then we see that the server stream channel handlers may be called inchannelInactive while the server last response frame is still writing and is not yet flushed.
In other words, the ChannelFuture of the last server response sent to the client is not "success", but is "incomplete" at the time the child stream channel handlers are called in channelInactive.

Normally, if I'm correct, when a channel handler is called in channelInactive, it means that it is now inactive and has reached its end of lifetime. So when our handlers are called in channelInactive while the response is not yet flushed (because parent channel was non-writable), then then we are getting into troubles, because we are then trying to cleanup resources, like the buffer of the last server response, but a bit later, when the last response is now flushed, then the buffer will be freed again, and we end up with many IllegalReferenceCountExceptions. So, we think that channelInactive should only be invoked after the last response is fully written and released.

To reproduce the issue and investigate it with the debugger, you can first run the Http2MultiplexTransportTest.streamHandlerInactivatedResponseFlushed from this PR, but without applying the patch. Only pay attention to the "serverloop" thread.

I tend to think that the problem may start from DefaultHttp2RemoteFlowController.FlowState.writeAllocatedBytes(int allocated), line 368, where frame.writeComplete() is called: this method will indirectly trigger AbstractHttp2StreamChannel.fireChannelInactiveAndDeregister(), line 742, but without waiting for the frame promise to complete (the promise is in DefaultHttp2ConnectionEncoder.FlowControlledBase).

Modification:

Added a reproducer test inHttp2MultiplexTransportTest.streamHandlerInactivatedResponseFlushedtest. This test simulates non-writable socket by configuring the SO_SNDBUF of the server side connection to 1. This will trigger the issue, because when the server will respond, it will get many writability events and the response won't be flushed immediately.

I have tried to apply the following simple patch which seems to resolve the problem: In. Http2ConnectionHandler.closeStream method, ensures that the stream is closed once the future is completed:

instead of doing:

    @Override
    public void closeStream(final Http2Stream stream, ChannelFuture future) {
        stream.close();

        if (future.isDone()) {
            checkCloseConnection(future);
        } else {
            future.addListener(new ChannelFutureListener() {
                @Override
                public void operationComplete(ChannelFuture future) throws Exception {
                    checkCloseConnection(future);
                }
            });
        }
    }

the closing is then done in the future listener, like this:

Override
    public void closeStream(final Http2Stream stream, ChannelFuture future) {
        if (future.isDone()) {
            doCloseStream(stream, future);
        } else {
            future.addListener(f -> doCloseStream(stream, future));
        }
    }
    
    private void doCloseStream(final Http2Stream stream, ChannelFuture future) {
        stream.close();
        checkCloseConnection(future);
    }

This seems to resolve the issue, because the stream will be closed only once the last response frame has been fully flushed.
The Http2ConnectionHandlerTest.canCloseStreamWithVoidPromise has also been modified in order to set the promise to success before doing the test:

    @Test
    public void canCloseStreamWithVoidPromise() throws Exception {
        handler = newHandler();
        handler.closeStream(stream, ctx.voidPromise().setSuccess());
        verify(stream, times(1)).close();
        verifyNoMoreInteractions(stream);
    }

Result:

Fixes reactor/reactor-netty#2760

PS: if the validity of this issue is accepted, but the PR is not satisfactory, I'll be more than happy to attempt a new patch, please let me know.

@pderop

pderop commented May 19, 2023

Copy link
Copy Markdown
Contributor Author

it seems that the 5 cancelled checks are caused by the failure of linux-x86_64-java11-graal; so I have downloaded the logs from linux-x86_64-java11-graal check, but it seems that the failure comes from Netty/Handler, which is not related to my modifications:

2023-05-19T19:55:52.2571453Z [ERROR] Tests run: 18, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.815 s <<< FAILURE! - in io.netty.handler.ssl.CipherSuiteCanaryTest
2023-05-19T19:55:52.2579076Z [ERROR] testHandshake{SslProvider, SslProvider, String, boolean}[6]  Time elapsed: 0.051 s  <<< ERROR!
2023-05-19T19:55:52.2579747Z java.util.concurrent.CancellationException
2023-05-19T19:55:52.2580388Z 	at io.netty.util.concurrent.DefaultPromise.cancel(...)(Unknown Source)
2023-05-19T19:55:52.2580763Z 

So, I don't know what to think for the moment; and I will need to investigate this before you can proceed with the review.

@pderop

pderop commented May 19, 2023

Copy link
Copy Markdown
Contributor Author

@normanmaurer, could you manually restart only the 5 cancelled checks , in order to see how they run ?
thank you.

@normanmaurer normanmaurer merged commit f76d646 into netty:4.1 May 22, 2023
@normanmaurer

Copy link
Copy Markdown
Member

@pderop thanks a lot for the investigation and fix :)

@pderop

pderop commented May 22, 2023

Copy link
Copy Markdown
Contributor Author

thanks Norman for the review and for the merge :-)

I will also check if we have the same issue in Netty 5 (maybe this week if I can, or the next week).

@normanmaurer

Copy link
Copy Markdown
Member

@pderop I think the answer is yes... I am working on a port of it.

@pderop

pderop commented May 22, 2023

Copy link
Copy Markdown
Contributor Author

ok, thanks !

normanmaurer pushed a commit that referenced this pull request May 22, 2023
…ss (#13388)

Motivation:

In the context of Reactor-Netty, we experience an issue during HTTP2 (or
H2C) load tests.
We have an HTTP2 (or H2C, it does not matter) server which is using in
the pipeline an Http2FrameCodec, followed by a Http2MultiplexHandler,
and when the client sends a new request, then the Http2MultiplexHandler
will create a child `Http2MultiplexHandlerStreamChannel` that will
contain in its pipeline an Http2StreamFrameToHttpObjectCodec followed by
some other specific reactor-netty handlers.

Now, the issue is the following: when the client sends the last request
frame (with EOS flag=true), and when the server sends the last response
frame (also with EOS flag = true), then in the server the stream will be
closed, and the stream channel handlers will be called in
`channelInactive` (that is ok).
But sometimes, under high load, when the last server response can't be
flushed when the parent channel is non-writtable, then we see that the
server stream channel handlers may be called in`channelInactive` while
the server last response frame is still writing and is not yet flushed.
In other words, the `ChannelFuture` of the last server response sent to
the client is not "success", but is "incomplete" at the time the child
stream channel handlers are called in `channelInactive`.

Normally, if I'm correct, when a channel handler is called in
`channelInactive`, it means that it is now inactive and has reached its
end of lifetime. So when our handlers are called in `channelInactive`
while the response is not yet flushed (because parent channel was
non-writable), then then we are getting into troubles, because we are
then trying to cleanup resources, like the buffer of the last server
response, but a bit later, when the last response is now flushed, then
the buffer will be freed again, and we end up with many
IllegalReferenceCountExceptions. So, we think that `channelInactive`
should only be invoked after the last response is fully written and
released.

To reproduce the issue and investigate it with the debugger, you can
first run the
`Http2MultiplexTransportTest.streamHandlerInactivatedResponseFlushed`
from this PR, but without applying the patch. Only pay attention to the
"serverloop" thread.

I tend to think that the problem may start from
`DefaultHttp2RemoteFlowController.FlowState.writeAllocatedBytes(int
allocated)`, line 368, where `frame.writeComplete()` is called: this
method will indirectly trigger
`AbstractHttp2StreamChannel.fireChannelInactiveAndDeregister(), line
742`, but without waiting for the frame promise to complete (the promise
is in `DefaultHttp2ConnectionEncoder.FlowControlledBase`).

Modification:

Added a reproducer test
in`Http2MultiplexTransportTest.streamHandlerInactivatedResponseFlushed`test.
This test simulates non-writable socket by configuring the SO_SNDBUF of
the server side connection to `1`. This will trigger the issue, because
when the server will respond, it will get many writability events and
the response won't be flushed immediately.

I have tried to apply the following simple patch which seems to resolve
the problem: In. `Http2ConnectionHandler.closeStream` method, ensures
that the stream is closed once the future is completed:

instead of doing:

```
    @OverRide
    public void closeStream(final Http2Stream stream, ChannelFuture future) {
        stream.close();

        if (future.isDone()) {
            checkCloseConnection(future);
        } else {
            future.addListener(new ChannelFutureListener() {
                @OverRide
                public void operationComplete(ChannelFuture future) throws Exception {
                    checkCloseConnection(future);
                }
            });
        }
    }
```

the closing is then done in the future listener, like this:
```
Override
    public void closeStream(final Http2Stream stream, ChannelFuture future) {
        if (future.isDone()) {
            doCloseStream(stream, future);
        } else {
            future.addListener(f -> doCloseStream(stream, future));
        }
    }

    private void doCloseStream(final Http2Stream stream, ChannelFuture future) {
        stream.close();
        checkCloseConnection(future);
    }
```

This seems to resolve the issue, because the stream will be closed only
once the last response frame has been fully flushed.
The Http2ConnectionHandlerTest.canCloseStreamWithVoidPromise has also
been modified in order to set the promise to success before doing the
test:

```
    @test
    public void canCloseStreamWithVoidPromise() throws Exception {
        handler = newHandler();
        handler.closeStream(stream, ctx.voidPromise().setSuccess());
        verify(stream, times(1)).close();
        verifyNoMoreInteractions(stream);
    }
```

Result:

Fixes reactor/reactor-netty#2760
@normanmaurer

Copy link
Copy Markdown
Member

done...

@idelpivnitskiy

Copy link
Copy Markdown
Member

+1, the change makes sense, great find @pderop!

@pderop pderop deleted the 4.1.x-h2stream-inactivate-fired-while-write-still-in-progress branch June 1, 2023 13:30
dongjoon-hyun pushed a commit to apache/spark that referenced this pull request Jun 28, 2023
### What changes were proposed in this pull request?
This pr aims to upgrade netty from 4.1.92 to 4.1.93.

### Why are the changes needed?
1.v4.1.92 VS v4.1.93
netty/netty@netty-4.1.92.Final...netty-4.1.93.Final

2.The new version brings some bug fix, eg:
- Reset byte buffer in loop for AbstractDiskHttpData.setContent ([#13320](netty/netty#13320))
- OpenSSL MAX_CERTIFICATE_LIST_BYTES option supported ([#13365](netty/netty#13365))
- Adapt to DirectByteBuffer constructor in Java 21 ([#13366](netty/netty#13366))
- HTTP/2 encoder: allow HEADER_TABLE_SIZE greater than Integer.MAX_VALUE ([#13368](netty/netty#13368))
- Upgrade to latest netty-tcnative to fix memory leak ([#13375](netty/netty#13375))
- H2/H2C server stream channels deactivated while write still in progress ([#13388](netty/netty#13388))
- Channel#bytesBefore(un)writable off by 1 ([#13389](netty/netty#13389))
- HTTP/2 should forward shutdown user events to active streams ([#13394](netty/netty#13394))
- Respect the number of bytes read per datagram when using recvmmsg ([#13399](netty/netty#13399))

3.The release notes as follows:
- https://netty.io/news/2023/05/25/4-1-93-Final.html

4.Why not upgrade to `4-1-94-Final` version?
Because the return value of the 'threadCache()' (from `PoolThreadCache` to `PoolArenasCache`) method of the netty Inner class used in the 'arrow memory netty' version '12.0.1' has changed and belongs to break change, let's wait for the upgrade of the 'arrow memory netty' before upgrading to the '4-1-94-Final' version.

The reference is as follows:
https://github.com/apache/arrow/blob/6af660f48472b8b45a5e01b7136b9b040b185eb1/java/memory/memory-netty/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java#L164
https://github.com/netty/netty/blob/da1a448d5bc4f36cc1744db93fcaf64e198db2bd/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java#L732-L736

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA.

Closes #41681 from panbingkun/upgrade_netty.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IllegalReferenceCountException when using sendObject

3 participants