Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Shutdown socket in some cases to prevent endless hang on blocking operation#26898

Closed
wfurt wants to merge 9 commits intodotnet:masterfrom
wfurt:blocking_socket
Closed

Shutdown socket in some cases to prevent endless hang on blocking operation#26898
wfurt wants to merge 9 commits intodotnet:masterfrom
wfurt:blocking_socket

Conversation

@wfurt
Copy link
Member

@wfurt wfurt commented Feb 6, 2018

Fixes #22564
Fixes #26034

Use Shutdown() when closing on blocking sockets to prevent infinite blockage.

@wfurt
Copy link
Member Author

wfurt commented Feb 6, 2018

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build
@dotnet-bot Test Outerloop UWP CoreCLR x64 Debug Build
@dotnet-bot Test Outerloop NETFX x86 Release Build

@ianhays
Copy link
Contributor

ianhays commented Feb 6, 2018

@dotnet-bot test innerloop Fedora24 debug

@wfurt
Copy link
Member Author

wfurt commented Feb 6, 2018

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build
@dotnet-bot Test Outerloop UWP CoreCLR x64 Debug Build
@dotnet-bot Test Outerloop NETFX x86 Release Build
@dotnet-bot test innerloop Fedora24 debug

@davidsh davidsh changed the title [WIP] try to shutdown socket in some cases to prevent endless hung [WIP] Try to shutdown socket in some cases to prevent endless hang Feb 6, 2018
}
}

private void TryWakeup(InnerSafeCloseSocket innerSocket)
Copy link
Contributor

@davidsh davidsh Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would rename this to remove the 'Try' prefix since methods with that prefix have an expected pattern that doesn't match this. Comment applies to other files as well.

}
}

public bool GetNonBlocking() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: bracing style doesn't match other methods in file. should use Allman-style bracing.


private void TryWakeup(InnerSafeCloseSocket innerSocket)
{
if ((AsyncContext == null || !AsyncContext.GetNonBlocking()) && innerSocket != null && !_underlyingHandleNonBlocking && !innerSocket.IsClosed && !innerSocket.IsInvalid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fairly complicated condition and it's not clear to me exactly what it's validating. Can you add a comment?

Also, are both the IsClosed and IsInvalid checks necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check if I can simplify this. What I want is to find out if we have valid OS blocking socket. In all other cases I want to skip the shutdown and use existing behavior. There are existing test verifying we throw if dispose is called in middle of pending async operation. I did not want to change that behavior.

{
if ((AsyncContext == null || !AsyncContext.GetNonBlocking()) && innerSocket != null && !_underlyingHandleNonBlocking && !innerSocket.IsClosed && !innerSocket.IsInvalid)
{
Interop.Sys.Shutdown(innerSocket, SocketShutdown.Receive);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just Receive? Will this fix blocked sends/connects, or do those require shutting down the Send direction as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it @stephentoub . But I did not want to interfere with writing logic. At least on Unix, close() would always finish writing data. Also the write would finish or fail even if that may take some some. I'm not sure if there is case when it would hang forever. I can revisit that if you think we should.


private void TryWakeup(InnerSafeCloseSocket innerSocket)
{
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this isn't necessary; you can just remove the "return;" and leave the method empty. You might also add a comment indicating that it's empty on purpose, and explaining why

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may need to revisit this. comments from the bugs suggest that Windows may have problems in single core configuration. I plan to test it but I was using this PR to get wide test coverage from CI run.

Dispose();
if (innerSocket != null)
{
TryWakeup(innerSocket);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "Wait until it's safe" below referring to? I'm wondering if it's ok to do this here (or even required to do this here), or if it should move down to before the InnerReleaseHandle.

// Blocking read.
try
{
client.Receive(buffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to have tests for the other main blocking operations, e.g. Send, Connect, Accept, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I was planning to. There is separate bug for accept. The problem is that the shutdown does not work for that on OSX (works ok on Linux) I'm still investigating.

catch (ObjectDisposedException)
{
// Closed and disposed by parent thread.
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't still want to call client.Close in this case?

clientThread.Start();
Socket s = server.Accept();
// Close client socket from parent thread.
client.Close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the client.Close duplicated here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is way how to force close from other thread. The clientThread should be blocked in receive() . This mimics example from #26034

@wfurt
Copy link
Member Author

wfurt commented Feb 8, 2018

I updated unit tests to cover both known issues. Note, that the shutdown() trick does not work on OSX for accept() and it fails with ENOTCONN. It works just fine on Linux.
I modified the tests to be executed in separate process to avoid any possibly whole suite will hangs if anything goes wrong. I verified that the test fill fail with some details when the actual test function blocks.

Also note I did not find reliable way how to detect that child thread is blocked. The documentation says Running = "The thread has been started, it is not blocked" but that does not seems to count cases when we are blocked on IO. I added Send/Receive sequence and that seems to be reasonably reliable in my tests (Linux & OSX) If timing does not work right, the thread will get exception that object was disposed. That could possibly return TestSkip one day.

@wfurt
Copy link
Member Author

wfurt commented Feb 8, 2018

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build
@dotnet-bot Test Outerloop UWP CoreCLR x64 Debug Build
@dotnet-bot Test Outerloop NETFX x86 Release Build

@wfurt wfurt changed the title [WIP] Try to shutdown socket in some cases to prevent endless hang Shutdown socket in some cases to prevent endless hang on blocking operation Feb 26, 2018
@wfurt
Copy link
Member Author

wfurt commented Feb 26, 2018

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build
@dotnet-bot Test Outerloop UWP x64 Debug Build
@dotnet-bot Test Outerloop NETFX x86 Release Build

@tmds
Copy link
Member

tmds commented Mar 1, 2018

@wfurt on Linux, you can use connect to AF_UNSPEC to disconnect a TCP socket. This will make blocking receive and send calls return. Also epoll returns an event for the fd. This may work similar on OSX/BSD. I think its worth exploring as an alternative to the Shutdown call. Can you give that a try?

@tmds
Copy link
Member

tmds commented Mar 1, 2018

on Linux, you can use connect to AF_UNSPEC to disconnect a TCP socket.

I compared this with shutdown. The connect to AF_UNSPEC causes a connection reset to be sent to the peer.
So shutdown is better.

@wfurt
Copy link
Member Author

wfurt commented Mar 1, 2018

Thanks for the note @tmds. The shutdown should also flush any pending data from socket buffers. I'll experiment with AF_UNSPEC on OSX. I still did not figure out how to wake up blocking accept there.

@karelz
Copy link
Member

karelz commented Mar 12, 2018

@wfurt what is left to do here? Any ETA?

@wfurt
Copy link
Member Author

wfurt commented Mar 14, 2018

I'm not sure @karelz. Perkaps @stephentoub and @geoffkizer can take another look.
To me the biggest mystery is question @stephentoub asked: "What is the "Wait until it's safe" below referring to" That is exiting code and I was not comfortable changing it for this issue.
Id be happy to open new one as reminder to do it in future.

@karelz
Copy link
Member

karelz commented Mar 14, 2018

Why are the tests failing? Once the test runs are clean, please work with @stephentoub to clarify his comment & to get final code review.

}
}

private void UnblockSocket(InnerSafeCloseSocket innerSocket) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments to this to explain why this method is a no-op on Windows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you see harm of calling shutdown() on Windows even if it is not needed @davidsh -> just for consistency reason? For now, I was trying to keep the change minimal as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you see harm of calling shutdown() on Windows even if it is not needed

We already have an explicit API that developers can call to do Shutdown(). I don't think it is a good idea to do this automatically.

I also don't understand why these changes are needed in this PR. It seems like it isn't clear as to the root cause.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that on Unix calling read() or accept() will block calling thread until operation fails or succeeds -> possibly indefinitely. Unlike Async or non-blocking, there is no timeout associated with that call. That part seems to be working different way on Windows where we are really never blocked e.g. when there is request for cancelation we can stop pending operation from our socket code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if this is a platform difference in behavior (Windows vs. Linux), then you shouldn't add any code for Windows. But you should still add comments to this no-op method to explain. And the words above are probably what you need in your comment for source code.

@wfurt
Copy link
Member Author

wfurt commented Mar 14, 2018

I don't know why tests are failing @karelz. I get 404 for all the links. I can try to kick the tests again.

@wfurt
Copy link
Member Author

wfurt commented Mar 14, 2018

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build

@wfurt
Copy link
Member Author

wfurt commented Mar 14, 2018

most test failed on authentication failures or for unknown reasons. Valid runs have failures unrelated to this pr. (like crypto)

@wfurt
Copy link
Member Author

wfurt commented Mar 20, 2018

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build

@wfurt
Copy link
Member Author

wfurt commented Mar 25, 2018

@dotnet-bot test Linux x64 Release Build

@karelz karelz added this to the 2.2.0 milestone Apr 4, 2018
@stephentoub
Copy link
Member

@wfurt, are we still pursuing this?

@wfurt
Copy link
Member Author

wfurt commented May 24, 2018

Yes, I would like to. I need to investigate the CI failures.

@wfurt wfurt force-pushed the blocking_socket branch from fc1a30c to 72623a6 Compare May 25, 2018 19:31
@wfurt
Copy link
Member Author

wfurt commented May 25, 2018

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build

@wfurt
Copy link
Member Author

wfurt commented May 25, 2018

rebased with current master

@karelz
Copy link
Member

karelz commented Jul 19, 2018

@wfurt this is ancient PR. If it is not ready for merge, let's close it and reopen once you have time to finish it please.

@wfurt
Copy link
Member Author

wfurt commented Jul 25, 2018

@dotnet-bot test Outerloop Linux x64 Debug Build
@dotnet-bot test Outerloop Windows x64 Debug Build
@dotnet-bot test Outerloop OSX x64 Debug Build
@dotnet-bot Test Outerloop UWP CoreCLR x64 Debug Build
@dotnet-bot Test Outerloop NETFX x86 Release Build

@wfurt
Copy link
Member Author

wfurt commented Jul 25, 2018

I updated the tests and I did several hundreds runs with full parallel suite.

@danmoseley
Copy link
Member

@wfurt looks like hte only failures passed on rerun.

{
// This test verifies blocking behavior. Always run it as remote task
// to isolate any possible failures.
RemoteInvoke((address) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't thread enough isolation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess: failure could manifest as exception on threadpool thread?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't Task.Run going to take care of all of the exceptions softly? Could you give example (or simulation) of failure which we are afraid of?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to be sure that if something goes wrong we do not block whole xunit suite @krwq.
RemoteProcess seems more resilient. (even if maybe harder to debug)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @krwq here. If the test is failing, it doesn't really matter whether it is blocking the rest of xunit or not. Adding RemoteInvoke just seems to make the test more complicated.

@wfurt
Copy link
Member Author

wfurt commented Aug 6, 2018

@dotnet-bot test Outerloop OSX x64 Debug Build

@wfurt
Copy link
Member Author

wfurt commented Aug 8, 2018

all the OSX failures are in test going to external echo server and seems unrelated to this PR.

@karelz
Copy link
Member

karelz commented Aug 15, 2018

@wfurt please merge it today or close it.


private void UnblockSocket(InnerSafeCloseSocket innerSocket)
{
if ((AsyncContext == null || !AsyncContext.GetNonBlocking()) && innerSocket != null && !_underlyingHandleNonBlocking && !innerSocket.IsClosed && !innerSocket.IsInvalid)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are doing all these tests to try to minimize the impact of this change. I.e. we want to only do the shutdown call in the case where we know it's hanging today. Correct?

I think it's fine to be cautious here, but we should add a comment explaining that these checks are basically just to be cautious (unless that's not true for some of them). If/when we revisit the socket close logic (and I hope we will do this at some point in 3.0), we will want to revisit this, and a comment will help explain the rationale behind this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I can add comment with explanation. For Async/non-blocking we have no problem being stuck in system call.

byte[] buffer = new byte[1];
buffer[0] = Convert.ToByte('a');

Thread clientThread = new Thread(() =>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Task.Run instead of creating a new thread

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue below

@karelz
Copy link
Member

karelz commented Aug 23, 2018

@wfurt how far are we from merging? As I said earlier, if we can't finish it in couple of days, please close it for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants