Fix callbacks race in SelectorLoop.sock_connect.#366
Fix callbacks race in SelectorLoop.sock_connect.#3661st1 wants to merge 3 commits intopython:masterfrom
Conversation
|
Another change: I had to make |
|
The more I'm trying to fix this thing, the more tests break in interesting ways. Maybe we should just partially revert ed17848. |
|
Call in the author of that commit?
|
|
Yes, I put Jesse in cc. I was the one who reviewed it though, so the responsibility is on me. |
|
Don't beat yourself up. Bugs happen. Life goes on.
|
|
Alright, the tests are passing; please review. |
|
Thanks for continued guidance and effort on this, @1st1. I'll review it soon, if you like. I keep thinking, though, maybe we should revert this whole idea: the idea of skipping getaddrinfo if we can detect that the address is already resolved? It seems to be a bug factory. First because getaddrinfo's AI_NUMERICHOST is harder to simulate in Python, in all platforms, than we thought. Second, because my attempt to fallback to getaddrinfo in ed17848 introduced an additional yield, which caused the current race condition. Now that we've made getaddrinfo concurrent on Mac and BSD, getaddrinfo with AI_NUMERICHOST is no longer such a bottleneck. Would rolling back this whole line of changes be simpler and safer than continuing to whack bugs? |
Probably yes. On the other had, I like how |
|
I like Jesse's idea. It's good to be thinking about the maintainability of
this code.
…--Guido (mobile)
|
|
This LGTM, I think. It's getting pretty complicated and my confidence is shaken that either of us can detect bugs by inspection. I propose we merge this--it's valuable both to fix the bug for now, and to add tests to guard against regression. Then, I think (and Guido agrees) that we should revert this whole idea. So let's keep the new tests, as much as possible, but stop trying to simulate getaddrinfo's AI_NUMERICHOST. I regret this idea was released in Python 3.5.2. I'm glad it's still "provisional". =) |
I'm -1 on reverting anything TBH.
To conclude - I think we should just leave things as is (after this PR is merged). |
|
This fix will go in b2. |
|
Merged in d6dcf25. |
|
The test fails at least on one buildbot, "AMD64 FreeBSD CURRENT Non-Debug 3.x", please have a look: |
|
Note that we've removed |
|
The test has been removed from CPython and this repo. Closing this PR. |
While testing uvloop on recent CPython 3.5.2 I found a regression in
loop.sock_connect, introduced in ed17848.The bug breaks
loop.sock_*in a very serious way, making programs that use those methods prone to random hangs after socket is connected.How to trigger
Let's imagine we have a server, that sends some data (let's say
b'hello') to the client immediately after connect. And the client program is the following:If the
PAYLOADis big enough, the client program will hang forever.Explanation
The cause of the hang is a race between callbacks -- one related to
loop.sock_connectand one tosock_sendall.Here's the relevant piece of code from
selector_events.py:Before ed17848,
sock_connectcalled_sock_connectdirectly:sock_connectcreated afutFuture._sock_connect, which attached a callback to thefut--_sock_connect_done.sock_connectthen returnedfutto the caller.asyncio.Task. Therefore,futnow have two callbacks attached to it:[_sock_connect_done, Task._wakeup]After that commit:
sock_connectcreates afutFuture._ensure_resolved(linkedfutto the result of that call's Future).sock_connectreturnsfutto the caller.Taskwill add a callback to thefut, eventually resulting in this:[Task._wakeup, _sock_connect_done]Therefore, after ed17848,
_sock_connect_donecan be called afterawait loop.sock_connect()line. If the program callsloop.sock_sendallaftersock_connect,_sock_connect_donewill remove writer callback thatsock_sendallset up./cc @gvanrossum @ajdavis @Haypo