Shutdown socket in some cases to prevent endless hang on blocking operation by wfurt · Pull Request #26898 · dotnet/corefx

wfurt · 2018-02-06T18:10:00Z

Fixes #22564
Fixes #26034

Use Shutdown() when closing on blocking sockets to prevent infinite blockage.

wfurt · 2018-02-06T18:12:04Z

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build
@dotnet-bot Test Outerloop UWP CoreCLR x64 Debug Build
@dotnet-bot Test Outerloop NETFX x86 Release Build

ianhays · 2018-02-06T18:56:34Z

@dotnet-bot test innerloop Fedora24 debug

wfurt · 2018-02-06T22:30:36Z

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build
@dotnet-bot Test Outerloop UWP CoreCLR x64 Debug Build
@dotnet-bot Test Outerloop NETFX x86 Release Build
@dotnet-bot test innerloop Fedora24 debug

davidsh · 2018-02-06T22:43:48Z

src/Common/src/System/Net/SafeCloseSocket.Unix.cs

            }
        }

+        private void TryWakeup(InnerSafeCloseSocket innerSocket)


nit: I would rename this to remove the 'Try' prefix since methods with that prefix have an expected pattern that doesn't match this. Comment applies to other files as well.

davidsh · 2018-02-06T22:44:29Z

src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

            }
        }

+        public bool GetNonBlocking() {


nit: bracing style doesn't match other methods in file. should use Allman-style bracing.

stephentoub · 2018-02-07T16:52:24Z

src/Common/src/System/Net/SafeCloseSocket.Unix.cs


+        private void TryWakeup(InnerSafeCloseSocket innerSocket)
+        {
+            if ((AsyncContext == null || !AsyncContext.GetNonBlocking()) && innerSocket != null && !_underlyingHandleNonBlocking && !innerSocket.IsClosed && !innerSocket.IsInvalid)


This is a fairly complicated condition and it's not clear to me exactly what it's validating. Can you add a comment?

Also, are both the IsClosed and IsInvalid checks necessary?

I'll check if I can simplify this. What I want is to find out if we have valid OS blocking socket. In all other cases I want to skip the shutdown and use existing behavior. There are existing test verifying we throw if dispose is called in middle of pending async operation. I did not want to change that behavior.

stephentoub · 2018-02-07T16:52:53Z

src/Common/src/System/Net/SafeCloseSocket.Unix.cs

+        {
+            if ((AsyncContext == null || !AsyncContext.GetNonBlocking()) && innerSocket != null && !_underlyingHandleNonBlocking && !innerSocket.IsClosed && !innerSocket.IsInvalid)
+            {
+                Interop.Sys.Shutdown(innerSocket, SocketShutdown.Receive);


Just Receive? Will this fix blocked sends/connects, or do those require shutting down the Send direction as well?

I was thinking about it @stephentoub . But I did not want to interfere with writing logic. At least on Unix, close() would always finish writing data. Also the write would finish or fail even if that may take some some. I'm not sure if there is case when it would hang forever. I can revisit that if you think we should.

stephentoub · 2018-02-07T16:53:30Z

src/Common/src/System/Net/SafeCloseSocket.Windows.cs


+        private void TryWakeup(InnerSafeCloseSocket innerSocket)
+        {
+            return;


Nit: this isn't necessary; you can just remove the "return;" and leave the method empty. You might also add a comment indicating that it's empty on purpose, and explaining why

I may need to revisit this. comments from the bugs suggest that Windows may have problems in single core configuration. I plan to test it but I was using this PR to get wide test coverage from CI run.

stephentoub · 2018-02-07T16:54:39Z

src/Common/src/System/Net/SafeCloseSocket.cs

                Dispose();
                if (innerSocket != null)
                {
+                    TryWakeup(innerSocket);


What is the "Wait until it's safe" below referring to? I'm wondering if it's ok to do this here (or even required to do this here), or if it should move down to before the InnerReleaseHandle.

stephentoub · 2018-02-07T16:55:05Z

src/System.Net.Sockets/tests/FunctionalTests/SendReceive.cs

+                    // Blocking read.
+                    try
+                    {
+                        client.Receive(buffer);


It'd be good to have tests for the other main blocking operations, e.g. Send, Connect, Accept, etc.

yes, I was planning to. There is separate bug for accept. The problem is that the shutdown does not work for that on OSX (works ok on Linux) I'm still investigating.

stephentoub · 2018-02-07T16:55:25Z

src/System.Net.Sockets/tests/FunctionalTests/SendReceive.cs

+                    catch (ObjectDisposedException)
+                    {
+                        // Closed and disposed by parent thread.
+                        return;


We don't still want to call client.Close in this case?

stephentoub · 2018-02-07T16:55:54Z

src/System.Net.Sockets/tests/FunctionalTests/SendReceive.cs

+                clientThread.Start();
+                Socket s = server.Accept();
+                // Close client socket from parent thread.
+                client.Close();


Why is the client.Close duplicated here as well?

This is way how to force close from other thread. The clientThread should be blocked in receive() . This mimics example from #26034

wfurt · 2018-02-08T09:24:31Z

I updated unit tests to cover both known issues. Note, that the shutdown() trick does not work on OSX for accept() and it fails with ENOTCONN. It works just fine on Linux.
I modified the tests to be executed in separate process to avoid any possibly whole suite will hangs if anything goes wrong. I verified that the test fill fail with some details when the actual test function blocks.

Also note I did not find reliable way how to detect that child thread is blocked. The documentation says Running = "The thread has been started, it is not blocked" but that does not seems to count cases when we are blocked on IO. I added Send/Receive sequence and that seems to be reasonably reliable in my tests (Linux & OSX) If timing does not work right, the thread will get exception that object was disposed. That could possibly return TestSkip one day.

wfurt · 2018-02-08T09:24:58Z

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build
@dotnet-bot Test Outerloop UWP CoreCLR x64 Debug Build
@dotnet-bot Test Outerloop NETFX x86 Release Build

wfurt · 2018-02-26T22:30:15Z

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build
@dotnet-bot Test Outerloop UWP x64 Debug Build
@dotnet-bot Test Outerloop NETFX x86 Release Build

tmds · 2018-03-01T10:01:15Z

@wfurt on Linux, you can use connect to AF_UNSPEC to disconnect a TCP socket. This will make blocking receive and send calls return. Also epoll returns an event for the fd. This may work similar on OSX/BSD. I think its worth exploring as an alternative to the Shutdown call. Can you give that a try?

tmds · 2018-03-01T15:25:01Z

on Linux, you can use connect to AF_UNSPEC to disconnect a TCP socket.

I compared this with shutdown. The connect to AF_UNSPEC causes a connection reset to be sent to the peer.
So shutdown is better.

wfurt · 2018-03-01T17:23:45Z

Thanks for the note @tmds. The shutdown should also flush any pending data from socket buffers. I'll experiment with AF_UNSPEC on OSX. I still did not figure out how to wake up blocking accept there.

karelz · 2018-03-12T15:51:58Z

@wfurt what is left to do here? Any ETA?

wfurt · 2018-03-14T04:38:41Z

I'm not sure @karelz. Perkaps @stephentoub and @geoffkizer can take another look.
To me the biggest mystery is question @stephentoub asked: "What is the "Wait until it's safe" below referring to" That is exiting code and I was not comfortable changing it for this issue.
Id be happy to open new one as reminder to do it in future.

karelz · 2018-03-14T16:13:55Z

Why are the tests failing? Once the test runs are clean, please work with @stephentoub to clarify his comment & to get final code review.

davidsh · 2018-03-14T16:50:34Z

src/Common/src/System/Net/SafeCloseSocket.Windows.cs

            }
        }

+        private void UnblockSocket(InnerSafeCloseSocket innerSocket) {}


Please add comments to this to explain why this method is a no-op on Windows.

Would you see harm of calling shutdown() on Windows even if it is not needed @davidsh -> just for consistency reason? For now, I was trying to keep the change minimal as possible.

Would you see harm of calling shutdown() on Windows even if it is not needed

We already have an explicit API that developers can call to do Shutdown(). I don't think it is a good idea to do this automatically.

I also don't understand why these changes are needed in this PR. It seems like it isn't clear as to the root cause.

The difference is that on Unix calling read() or accept() will block calling thread until operation fails or succeeds -> possibly indefinitely. Unlike Async or non-blocking, there is no timeout associated with that call. That part seems to be working different way on Windows where we are really never blocked e.g. when there is request for cancelation we can stop pending operation from our socket code.

So, if this is a platform difference in behavior (Windows vs. Linux), then you shouldn't add any code for Windows. But you should still add comments to this no-op method to explain. And the words above are probably what you need in your comment for source code.

wfurt · 2018-03-14T18:24:40Z

I don't know why tests are failing @karelz. I get 404 for all the links. I can try to kick the tests again.

wfurt · 2018-03-14T18:24:57Z

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build

wfurt · 2018-03-14T22:30:26Z

most test failed on authentication failures or for unknown reasons. Valid runs have failures unrelated to this pr. (like crypto)

wfurt · 2018-03-20T20:34:25Z

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build

wfurt · 2018-03-25T03:54:16Z

@dotnet-bot test Linux x64 Release Build

stephentoub · 2018-05-24T14:30:36Z

@wfurt, are we still pursuing this?

wfurt · 2018-05-24T15:28:01Z

Yes, I would like to. I need to investigate the CI failures.

wfurt · 2018-05-25T19:33:51Z

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build

wfurt · 2018-05-25T19:34:02Z

rebased with current master

…ng_socket

karelz · 2018-07-19T12:41:51Z

@wfurt this is ancient PR. If it is not ready for merge, let's close it and reopen once you have time to finish it please.

wfurt · 2018-07-25T18:21:04Z

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build
@dotnet-bot Test Outerloop UWP CoreCLR x64 Debug Build
@dotnet-bot Test Outerloop NETFX x86 Release Build

wfurt · 2018-07-25T18:22:08Z

I updated the tests and I did several hundreds runs with full parallel suite.

danmoseley · 2018-07-28T05:32:44Z

@wfurt looks like hte only failures passed on rerun.

krwq · 2018-08-04T04:32:30Z

src/System.Net.Sockets/tests/FunctionalTests/SendReceive.cs

+        {
+            // This test verifies blocking behavior. Always run it as remote task
+            // to isolate any possible failures.
+            RemoteInvoke((address) =>


why isn't thread enough isolation?

Guess: failure could manifest as exception on threadpool thread?

Isn't Task.Run going to take care of all of the exceptions softly? Could you give example (or simulation) of failure which we are afraid of?

I wanted to be sure that if something goes wrong we do not block whole xunit suite @krwq.
RemoteProcess seems more resilient. (even if maybe harder to debug)

I agree with @krwq here. If the test is failing, it doesn't really matter whether it is blocking the rest of xunit or not. Adding RemoteInvoke just seems to make the test more complicated.

wfurt · 2018-08-06T22:12:21Z

@dotnet-bot test Outerloop OSX x64 Debug Build

wfurt · 2018-08-08T17:52:12Z

all the OSX failures are in test going to external echo server and seems unrelated to this PR.

karelz · 2018-08-15T15:17:29Z

@wfurt please merge it today or close it.

geoffkizer · 2018-08-16T19:48:02Z

src/Common/src/System/Net/SafeCloseSocket.Unix.cs


+        private void UnblockSocket(InnerSafeCloseSocket innerSocket)
+        {
+            if ((AsyncContext == null || !AsyncContext.GetNonBlocking()) && innerSocket != null && !_underlyingHandleNonBlocking && !innerSocket.IsClosed && !innerSocket.IsInvalid)


I think we are doing all these tests to try to minimize the impact of this change. I.e. we want to only do the shutdown call in the case where we know it's hanging today. Correct?

I think it's fine to be cautious here, but we should add a comment explaining that these checks are basically just to be cautious (unless that's not true for some of them). If/when we revisit the socket close logic (and I hope we will do this at some point in 3.0), we will want to revisit this, and a comment will help explain the rationale behind this.

yes. I can add comment with explanation. For Async/non-blocking we have no problem being stuck in system call.

geoffkizer · 2018-08-16T19:55:09Z

src/System.Net.Sockets/tests/FunctionalTests/SendReceive.cs

+                    byte[] buffer = new byte[1];
+                    buffer[0] = Convert.ToByte('a');
+
+                    Thread clientThread = new Thread(() =>


Use Task.Run instead of creating a new thread

Same issue below

karelz · 2018-08-23T20:52:29Z

@wfurt how far are we from merging? As I said earlier, if we can't finish it in couple of days, please close it for now.

davidsh added the area-System.Net.Sockets label Feb 6, 2018

davidsh changed the title ~~[WIP] try to shutdown socket in some cases to prevent endless hung~~ [WIP] Try to shutdown socket in some cases to prevent endless hang Feb 6, 2018

davidsh reviewed Feb 6, 2018

View reviewed changes

stephentoub reviewed Feb 7, 2018

View reviewed changes

karelz assigned wfurt Feb 9, 2018

wfurt changed the title ~~[WIP] Try to shutdown socket in some cases to prevent endless hang~~ Shutdown socket in some cases to prevent endless hang on blocking operation Feb 26, 2018

davidsh reviewed Mar 14, 2018

View reviewed changes

karelz added this to the 2.2.0 milestone Apr 4, 2018

Tomas Weinfurt and others added 5 commits May 25, 2018 12:30

try to shutdown socket in some cases to prevent endless hung

6af5cff

fix windows build and add unit test for #22564 scenario.

cf5a230

update tests and take some feedback from reviews

8105b12

update style

a57f07f

add comment to empty UnblockSocket() on Windows

72623a6

wfurt force-pushed the blocking_socket branch from fc1a30c to 72623a6 Compare May 25, 2018 19:31

Tomas Weinfurt and others added 2 commits May 29, 2018 17:11

correct rebase attempt

07b0b63

Merge branch 'master' of https://github.com/dotnet/corefx into blocki…

4f6a8ab

…ng_socket

update tests to be more reliable

690e158

krwq reviewed Aug 4, 2018

View reviewed changes

mc-denisov mentioned this pull request Aug 12, 2018

Sshclient deadlock/freeze on disconnect sshnet/SSH.NET#355

Closed

geoffkizer reviewed Aug 16, 2018

View reviewed changes

feedback from review

21ddfba

karelz closed this Aug 30, 2018

wfurt mentioned this pull request Sep 4, 2018

Shutdown socket in some cases to prevent endless hang on blocking operation #32087

Closed

Conversation

wfurt commented Feb 6, 2018 • edited by karelz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wfurt commented Feb 6, 2018

Uh oh!

ianhays commented Feb 6, 2018

Uh oh!

wfurt commented Feb 6, 2018

Uh oh!

davidsh Feb 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wfurt commented Feb 8, 2018

Uh oh!

wfurt commented Feb 8, 2018

Uh oh!

wfurt commented Feb 26, 2018

Uh oh!

tmds commented Mar 1, 2018

Uh oh!

tmds commented Mar 1, 2018

Uh oh!

wfurt commented Mar 1, 2018

Uh oh!

karelz commented Mar 12, 2018

Uh oh!

wfurt commented Mar 14, 2018

Uh oh!

karelz commented Mar 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wfurt commented Mar 14, 2018

Uh oh!

wfurt commented Mar 14, 2018

Uh oh!

wfurt commented Mar 14, 2018

Uh oh!

wfurt commented Mar 20, 2018

Uh oh!

wfurt commented Mar 25, 2018

Uh oh!

stephentoub commented May 24, 2018

Uh oh!

wfurt commented Feb 6, 2018 •

edited by karelz

Loading

davidsh Feb 6, 2018 •

edited

Loading