async_tcp_client: remove callbacks if connection was not closed#35410
async_tcp_client: remove callbacks if connection was not closed#35410ggreenway merged 8 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: Ohad Vano <ohadvano@gmail.com>
Signed-off-by: Ohad Vano <ohadvano@gmail.com>
|
/assign @botengyao |
botengyao
left a comment
There was a problem hiding this comment.
Overall LGTM, and thanks for the fix!
Do you mind also adding an integration test to tcp_async_client_integration_test.cc?
|
I don't think this is testable in integration tests, but if you find a way to, please lmk. |
|
The existing tcp async client integration test is using a fake filter, and the life time of the async client in the test filter is the one as the downstream connection which has the longest lifetime. To simulate this scenario, we can create a customized usage that is owning this async client and then destroy the object at first, one example is I think this can be simulated by unit tests with the control of the life time. |
Are you referring to the close() and cleanup() sections, if so, then this is not enough. This would just do a regular close of the upstream connection, and there won't be a crash.
The changes I made to the unit tests will test this. Without the fix, many of the unit tests in the file will fail (those that don't explicitly call |
There was a problem hiding this comment.
Are you referring to the close() and cleanup() sections
To really simulate the issue, the AsyncTcpClient itself needs to be deleted before FakeUpstream is being destroyed, while still having an active upstream connection
Nope, like you described, we can establish an object that is using this async client to manually destroy it at first.
LGTM, and deferred to @ggreenway for another pass.
|
/assign @ggreenway |
Signed-off-by: Ohad Vano <ohadvano@gmail.com>
| connection_->removeConnectionCallbacks(*this); | ||
| } | ||
|
|
||
| close(Network::ConnectionCloseType::NoFlush); |
There was a problem hiding this comment.
Could you elaborate more for the reason that the close() is added here? I don't see from your initial fix, and I feel a user is possible to close the connection with other type.
There was a problem hiding this comment.
The reason is because if we don't close it, tests fail because of the ASSERT in:
envoy/source/common/network/connection_impl.cc
Lines 116 to 125 in b1e3351
The comment in that code section explains the expectation out of the owner of the ConnectionImpl:
In general we assume that owning code has called close() previously to the destructor being run.
There was a problem hiding this comment.
I feel this is to mitigate the wrong usage of this client in the test filter to simulate the callback issue, and ideally this should not happen. It needs to be fixed in the usage part. FWIW, the original version of fix LGTM, and we can change how we test it or remove the newly added tests, either works for me.
There was a problem hiding this comment.
The changes differ in the close call, is there a reason why we shouldn't include it, regardless of the testing?
Removing the newly added test meaning no integration test coverage for this case.
There was a problem hiding this comment.
I think the reason you see the non-close error is because we force release the async_client before calling any close as the wrong usage, and that is also the assertion there. If we really want to keep the close here to cover this should-not-happen case, adding a close boolean for close may be less risky. I am more worried that if a caller calls close(FlushWriteAndDelay) and then following with a close(NoFlush) when the connection is still in the defer deletion list, the previous one will be overrode.
|
/wait |
|
/retest |
botengyao
left a comment
There was a problem hiding this comment.
Thanks, this looks neat, overall LGTM module one concern for the recent change.
| connection_->removeConnectionCallbacks(*this); | ||
| } | ||
|
|
||
| close(Network::ConnectionCloseType::NoFlush); |
There was a problem hiding this comment.
I think the reason you see the non-close error is because we force release the async_client before calling any close as the wrong usage, and that is also the assertion there. If we really want to keep the close here to cover this should-not-happen case, adding a close boolean for close may be less risky. I am more worried that if a caller calls close(FlushWriteAndDelay) and then following with a close(NoFlush) when the connection is still in the defer deletion list, the previous one will be overrode.
|
@botengyao thanks for the review. I think what you're saying makes sense. The My suggestion is modifying So the condition would be |
@ohadvano, sgtm, a close boolean works, thanks! |
Signed-off-by: Ohad Vano <ohadvano@gmail.com>
|
ping @ggreenway @botengyao |
…yproxy#35410) When the ``AsyncTcpClient`` is being destroyed but it also has an active client connection, there's a crash since during the instance destruction, the ``ClientConnection`` object would also be destroyed, causing ``raiseEvent`` to be called back to ``AsyncTcpClient`` while it is being destroyed Caught with the following stack trace: ``` Caught Segmentation fault, suspect faulting address 0x0 Backtrace (use tools/stack_decode.py to get line numbers): Envoy version: ee8c765a07037033766ea556c032120b497152b3/1.27.0/Clean/RELEASE/BoringSSL #0: __restore_rt [0x7d80ab903420] envoyproxy#1: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::onEvent() [0x58313528746b] envoyproxy#2: Envoy::Tcp::AsyncTcpClientImpl::onEvent() [0x5831359da00a] envoyproxy#3: Envoy::Network::ConnectionImplBase::raiseConnectionEvent() [0x583135f0521d] envoyproxy#4: Envoy::Network::ConnectionImpl::raiseEvent() [0x583135e9fed9] envoyproxy#5: Envoy::Network::ConnectionImpl::closeSocket() [0x583135e9f90c] envoyproxy#6: Envoy::Network::ConnectionImpl::close() [0x583135e9e54c] envoyproxy#7: Envoy::Network::ConnectionImpl::~ConnectionImpl() [0x583135e9de5c] envoyproxy#8: Envoy::Network::ClientConnectionImpl::~ClientConnectionImpl() [0x5831355fd25e] envoyproxy#9: Envoy::Tcp::AsyncTcpClientImpl::~AsyncTcpClientImpl() [0x5831359da247] envoyproxy#10: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::~FluentdAccessLoggerImpl() [0x583135289350] envoyproxy#11: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLog::ThreadLocalLogger::~ThreadLocalLogger() [0x58313528adbf] envoyproxy#12: Envoy::ThreadLocal::InstanceImpl::shutdownThread() [0x58313560373a] envoyproxy#13: Envoy::Server::WorkerImpl::threadRoutine() [0x583135630c0a] envoyproxy#14: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()envoyproxy#1}::__invoke() [0x5831364e88d5] envoyproxy#15: start_thread [0x7d80ab8f7609] ``` Risk Level: low Testing: unit tests Docs Changes: none Release Notes: none Platform Specific Features: none --------- Signed-off-by: Ohad Vano <ohadvano@gmail.com> Signed-off-by: Martin Duke <martin.h.duke@gmail.com>
…yproxy#35410) When the ``AsyncTcpClient`` is being destroyed but it also has an active client connection, there's a crash since during the instance destruction, the ``ClientConnection`` object would also be destroyed, causing ``raiseEvent`` to be called back to ``AsyncTcpClient`` while it is being destroyed Caught with the following stack trace: ``` Caught Segmentation fault, suspect faulting address 0x0 Backtrace (use tools/stack_decode.py to get line numbers): Envoy version: ee8c765a07037033766ea556c032120b497152b3/1.27.0/Clean/RELEASE/BoringSSL #0: __restore_rt [0x7d80ab903420] envoyproxy#1: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::onEvent() [0x58313528746b] envoyproxy#2: Envoy::Tcp::AsyncTcpClientImpl::onEvent() [0x5831359da00a] envoyproxy#3: Envoy::Network::ConnectionImplBase::raiseConnectionEvent() [0x583135f0521d] envoyproxy#4: Envoy::Network::ConnectionImpl::raiseEvent() [0x583135e9fed9] envoyproxy#5: Envoy::Network::ConnectionImpl::closeSocket() [0x583135e9f90c] envoyproxy#6: Envoy::Network::ConnectionImpl::close() [0x583135e9e54c] envoyproxy#7: Envoy::Network::ConnectionImpl::~ConnectionImpl() [0x583135e9de5c] envoyproxy#8: Envoy::Network::ClientConnectionImpl::~ClientConnectionImpl() [0x5831355fd25e] envoyproxy#9: Envoy::Tcp::AsyncTcpClientImpl::~AsyncTcpClientImpl() [0x5831359da247] envoyproxy#10: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::~FluentdAccessLoggerImpl() [0x583135289350] envoyproxy#11: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLog::ThreadLocalLogger::~ThreadLocalLogger() [0x58313528adbf] envoyproxy#12: Envoy::ThreadLocal::InstanceImpl::shutdownThread() [0x58313560373a] envoyproxy#13: Envoy::Server::WorkerImpl::threadRoutine() [0x583135630c0a] envoyproxy#14: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()envoyproxy#1}::__invoke() [0x5831364e88d5] envoyproxy#15: start_thread [0x7d80ab8f7609] ``` Risk Level: low Testing: unit tests Docs Changes: none Release Notes: none Platform Specific Features: none --------- Signed-off-by: Ohad Vano <ohadvano@gmail.com> Signed-off-by: asingh-g <abhisinghx@google.com>
Commit Message: when the
AsyncTcpClientis being destroyed but it also has an active client connection, there's a crash since during the instance destruction, theClientConnectionobject would also be destroyed, causingraiseEventto be called back toAsyncTcpClientwhile it is being destroyedAdditional Description: Caught with the following stack trace:
Risk Level: low
Testing: unit tests
Docs Changes: none
Release Notes: none
Platform Specific Features: none