Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Socket: Linux: unblock synchronous socket operations on Dispose#37486

Merged
davidsh merged 47 commits intodotnet:masterfrom
tmds:linux_unblock_socket
Jun 20, 2019
Merged

Socket: Linux: unblock synchronous socket operations on Dispose#37486
davidsh merged 47 commits intodotnet:masterfrom
tmds:linux_unblock_socket

Conversation

@tmds
Copy link
Member

@tmds tmds commented May 7, 2019

This makes Socket.Dispose on Linux more similar as Dispose on Windows: on-going synchronous operations are stopped. The implementation works on Linux, but isn't guaranteed to work on other Unixes.

based on #32087

Fixes #22564
Fixes #26034

CC @wfurt @stephentoub @geoffkizer @davidsh

@stephentoub
Copy link
Member

Previously when @geoffkizer and I had talked about solving this general issue, we'd outlined a general plan as follows.

  1. Use the existing SafeHandle infrastructure to reference count the native socket handle. Every time we use the socket for a native call, we add a ref count (either relying on standard SafeHandle marshaling, or using DangerousAddRef manually), and when it returns, we release the count (again, either relying on standard marshaling, or using DangerousRelease). This guarantees the handle will never be used after we close the socket, because DangerousAddRef will throw if the SafeHandle has been disposed.

  2. When Socket.Close is called, we should:
    a) Call Shutdown(Both), which will cause any subsequent socket operations to fail.
    b) On Windows call CancelIoEx, which will abort any in progress operations.
    c) Dispose the SafeHandle.

  3. Map cases where an operation failed due to the above logic to ConnectionAborted for consistency with the current behavior. To achieve that, we might need to either set a flag on the Socket, or in (2) above, prior to calling shutdown, null out the SafeHandle from the Socket so that subsequent attempts at starting operations won't be able to find it and so that we can assume failures when it's been nulled out are due to connection aborted.

The goal here is to ensure that all pending and all future operations are canceled, and doing things in the "right" order such that we avoid race conditions where as we're executing this logic additional requests are being made.

It looks like this PR is doing some of this, along with some specialized tweaks for Linux for 2a. Would it make sense to try to do more of this in this PR, or in soon-to-follow PRs?

With all of that in place, I then want to see us do two more things:

  • Make all Sockets P/Invokes use SafeHandles rather than IntPtrs. The only reason some of the synchronous calls currently use the less reliable DangerousGetHandle IntPtr approach is to allow Close to unblock them, but these changes obviate that erroneous goal.
  • Remove the then unnecessary double-layer of SafeHandles wrapping SafeHandles.

@tmds
Copy link
Member Author

tmds commented May 7, 2019

It looks like this PR is doing some of this, along with some specialized tweaks for Linux for 2a. Would it make sense to try to do more of this in this PR, or in soon-to-follow PRs?

@stephentoub Thanks for the info, it makes sense. I'm focusing on figuring out what 2a should be for Linux and get some confidence it works well. We can then do the other parts in this or subsequent PRs.

@tmds
Copy link
Member Author

tmds commented May 8, 2019

From CI results, on Windows: Dispose/Close with on-going operations causes an abortive close ('Connection reset by peer'). I was assuming this would be a FIN close, which is why the tests are failing on Windows.

The abortive close can be generated on Linux by doing the connect to AF_UNSPEC. This should be working, but I've just upgraded my machine to Fedora 30 and it seems it no longer works... This shows the difference in behavior: https://gist.github.com/tmds/1af5f4dc1658d7fe4510216cb244ecd5. I'm not sure when/where/why this changed but I'm going to try to find out. It may be a regression, or it may be intentional.

@tmds
Copy link
Member Author

tmds commented May 8, 2019

but I've just upgraded my machine to Fedora 30 and it seems it no longer works... This shows the difference in behavior: https://gist.github.com/tmds/1af5f4dc1658d7fe4510216cb244ecd5. I'm not sure when/where/why this changed but I'm going to try to find out. It may be a regression, or it may be intentional.

This is a regression and it will be fixed by this patch: https://lists.openwall.net/netdev/2019/05/08/102

@tmds
Copy link
Member Author

tmds commented May 9, 2019

On Linux this is working well.
On OSX, these fail to get cancelled: Accept, Connect, UdpReceive, SyncTcpReceiveSend.
The last one can be made to work by doing the Shutdown, but then the peer sees a FIN abort instead of a RST abort.

@davidsh
Copy link
Contributor

davidsh commented Jun 12, 2019

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@davidsh
Copy link
Contributor

davidsh commented Jun 13, 2019

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@davidsh
Copy link
Contributor

davidsh commented Jun 17, 2019

@tmds Can you please fix up the merge conflicts?

@davidsh
Copy link
Contributor

davidsh commented Jun 19, 2019

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@tmds
Copy link
Member Author

tmds commented Jun 20, 2019

CI failures:

  • Fail of PostAsyncExpect100Continue_LateForbiddenResponse_Ok on RHEL6:
System.IO.IOException : Unable to read data from the transport connection: Broken pipe.\n---- System.Net.Sockets.SocketException : Broken pipe
   at System.Net.Test.Common.Http2LoopbackServer.WriteFrameAsync(Frame frame) in /_/src/Common/tests/System/Net/Http/Http2LoopbackServer.cs:line 83
   at System.Net.Test.Common.Http2LoopbackServer.SendGoAway(Int32 lastStreamId) in /_/src/Common/tests/System/Net/Http/Http2LoopbackServer.cs:line 615
   at System.Net.Http.Functional.Tests.HttpClientHandlerTest_Http2.<>c__DisplayClass50_0.<<PostAsyncExpect100Continue_LateForbiddenResponse_Ok>b__1>d.MoveNext() in /_/src/System.Net.Http/tests/FunctionalTests/HttpClientHandlerTest.Http2.cs:line 1404
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.Tasks.TaskTimeoutExtensions.WhenAllOrAnyFailed(Task[] tasks) in /_/src/Common/tests/System/Threading/Tasks/TaskTimeoutExtensions.cs:line 83
   at System.Threading.Tasks.TaskTimeoutExtensions.WhenAllOrAnyFailed(Task[] tasks) in /_/src/Common/tests/System/Threading/Tasks/TaskTimeoutExtensions.cs:line 111
   at System.Threading.Tasks.TaskTimeoutExtensions.WhenAllOrAnyFailed(Task[] tasks, Int32 millisecondsTimeout) in /_/src/Common/tests/System/Threading/Tasks/TaskTimeoutExtensions.cs:line 71
   at System.Net.Test.Common.Http2LoopbackServer.CreateClientAndServerAsync(Func`2 clientFunc, Func`2 serverFunc, Int32 timeout) in /_/src/Common/tests/System/Net/Http/Http2LoopbackServer.cs:line 728
   at System.Net.Http.Functional.Tests.HttpClientHandlerTest_Http2.PostAsyncExpect100Continue_LateForbiddenResponse_Ok() in /_/src/System.Net.Http/tests/FunctionalTests/HttpClientHandlerTest.Http2.cs:line 1370
--- End of stack trace from previous location where exception was thrown ---
----- Inner Stack Trace -----
  • Fail of System.Diagnostics.Tests.ProcessTests.TestCheckChildProcessUserAndGroupIdsElevated on F29:
      Assert.NotNull() Failure
      Stack Trace:
        /_/src/System.Diagnostics.Process/tests/ProcessTests.Unix.cs(640,0): at System.Diagnostics.Tests.ProcessTests.GetCurrentRealUserName()
        /_/src/System.Diagnostics.Process/tests/ProcessTests.Unix.cs(615,0): at System.Diagnostics.Tests.ProcessTests.TestCheckChildProcessUserAndGroupIdsElevated(Boolean useRootGroups)
  • Linux x64_Debug was canceled.

@davidsh davidsh merged commit 3fac6ad into dotnet:master Jun 20, 2019
@krwq
Copy link
Member

krwq commented Jun 21, 2019

@wtgodbe do you have cycles to take a look at:
System.Diagnostics.Tests.ProcessTests.TestCheckChildProcessUserAndGroupIdsElevated
which is now failing after this? This is very likely a test issue which resurfaced after merging this - similarly as networking issue.

Most likely issue is that one of the streams (or something else using sockets) gets disposed sooner than test is prepared for.

@wtgodbe
Copy link
Member

wtgodbe commented Jun 21, 2019

I can probably take a look either today or Monday. Assert failure is at

@krwq
Copy link
Member

krwq commented Jun 21, 2019

thanks! possible it's unrelated 😄

@davidsh
Copy link
Contributor

davidsh commented Jun 21, 2019

@wfurt has done an analysis and it seems that this PR has unintentionally affected async Socket I/O. There are cases where it causes RST instead of FIN to be sent. A full write-up will be forthcoming.

For now, we are going to revert this PR for now.

@davidsh
Copy link
Contributor

davidsh commented Jun 21, 2019

We reverted this PR.

See #38765 for analysis.

See issue #38750 for additional feedback when redoing this PR.

steveharter pushed a commit to steveharter/dotnet_corefx that referenced this pull request Jun 25, 2019
…et#37486)

* Socket: Linux: unblock synchronous socket operations on Dispose

* Fix windows build

* TryUnblockSocket: always do an abortive close for TCP

* TryUnblockSocket: experiment, always call Shutdown

* SyncTcpReceiveSendGetsCancelledByDisposeOrClose: Don't check for abortive close

* Fix build

* Revert always shutdown to compare CI results

* PR feedback

* TcpReceiveSendGetsCanceledByDisposeOrClose: fix timing race

* TcpReceiveSendGetsCanceledByDisposeOrClose: fix timing race II

* Fix test

* TcpReceiveSendGetsCanceledByDisposeOrClose: fix timing race III

* TcpReceiveSendGetsCanceledByDisposeOrClose: only run on Sync

* Also perform abortive close when canceling async send/receive

* Fix aborted set to true

* Check what fails on OSX with Shutdown

* Revert "Check what fails on OSX with Shutdown"

This reverts commit 5ab7fce.

* Don't perform abortive close for operations that were already canceled

* interrup  Accept/Listen on OSX

* Find out what Disconnectx does for different tests on CI machines

* Move Linux connect AF_UNSPEC and OSX disconnectx into pal Disconnect function, update tests for OSX

* Trigger CI

* TcpReceiveSendGetsCanceledByDispose: check SocketErrorCode

* TcpReceiveSendGetsCanceledByDispose: OSX is also performing an RST close for disconnectx

* Revert "TcpReceiveSendGetsCanceledByDispose: OSX is also performing an RST close for disconnectx"

This reverts commit a8d5b02.

* TcpReceiveSendGetsCanceledByDispose: fix test for OSX

* call CancelIoEx to kick out pending operations

* Socket interop: replace IntPtr with SafeHandle

* Windows: RST close when operations were canceled

* Extend tests to check local SocketError

* test: update expected SocketErrors

* update tests: Apm API throws ObjectDisposedException instead of SocketException

* Extend asserts

* Update expected result for ConnectTask

* Update tests based on Windows behavior (dotnet#37706)

* Map SocketErrors in Socket.cs

* Rewrite nullable asserts

* Fix test for Mac

* Handle SendFile SocketError

* SyncSendFileGetsCanceledByDispose: change how we generate the large file

* SyncSendFileGetsCanceledByDispose: ensure unique filename

* Fix NETFX build

* SyncSendFileGetsCanceledByDispose: skip on NetFramework

* PR feedback

* Fix broken build

* Perform DangerousAddRef for Socket used with UpdateAcceptContext socket option
steveharter pushed a commit to steveharter/dotnet_corefx that referenced this pull request Jun 25, 2019
@karelz karelz added this to the 3.0 milestone Jul 16, 2019
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…et/corefx#37486)

* Socket: Linux: unblock synchronous socket operations on Dispose

* Fix windows build

* TryUnblockSocket: always do an abortive close for TCP

* TryUnblockSocket: experiment, always call Shutdown

* SyncTcpReceiveSendGetsCancelledByDisposeOrClose: Don't check for abortive close

* Fix build

* Revert always shutdown to compare CI results

* PR feedback

* TcpReceiveSendGetsCanceledByDisposeOrClose: fix timing race

* TcpReceiveSendGetsCanceledByDisposeOrClose: fix timing race II

* Fix test

* TcpReceiveSendGetsCanceledByDisposeOrClose: fix timing race III

* TcpReceiveSendGetsCanceledByDisposeOrClose: only run on Sync

* Also perform abortive close when canceling async send/receive

* Fix aborted set to true

* Check what fails on OSX with Shutdown

* Revert "Check what fails on OSX with Shutdown"

This reverts commit dotnet/corefx@5ab7fce.

* Don't perform abortive close for operations that were already canceled

* interrup  Accept/Listen on OSX

* Find out what Disconnectx does for different tests on CI machines

* Move Linux connect AF_UNSPEC and OSX disconnectx into pal Disconnect function, update tests for OSX

* Trigger CI

* TcpReceiveSendGetsCanceledByDispose: check SocketErrorCode

* TcpReceiveSendGetsCanceledByDispose: OSX is also performing an RST close for disconnectx

* Revert "TcpReceiveSendGetsCanceledByDispose: OSX is also performing an RST close for disconnectx"

This reverts commit dotnet/corefx@a8d5b02.

* TcpReceiveSendGetsCanceledByDispose: fix test for OSX

* call CancelIoEx to kick out pending operations

* Socket interop: replace IntPtr with SafeHandle

* Windows: RST close when operations were canceled

* Extend tests to check local SocketError

* test: update expected SocketErrors

* update tests: Apm API throws ObjectDisposedException instead of SocketException

* Extend asserts

* Update expected result for ConnectTask

* Update tests based on Windows behavior (dotnet/corefx#37706)

* Map SocketErrors in Socket.cs

* Rewrite nullable asserts

* Fix test for Mac

* Handle SendFile SocketError

* SyncSendFileGetsCanceledByDispose: change how we generate the large file

* SyncSendFileGetsCanceledByDispose: ensure unique filename

* Fix NETFX build

* SyncSendFileGetsCanceledByDispose: skip on NetFramework

* PR feedback

* Fix broken build

* Perform DangerousAddRef for Socket used with UpdateAcceptContext socket option


Commit migrated from dotnet/corefx@3fac6ad
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-System.Net.Sockets os-linux Linux OS (any supported distro)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TcpListener.Stop hangs if Accept is in progress Linux Sockets: Close with pending synchronous Receive hangs

8 participants