Fix callbacks race in SelectorLoop.sock_connect. by 1st1 · Pull Request #366 · python/asyncio

1st1 · 2016-06-28T19:25:30Z

While testing uvloop on recent CPython 3.5.2 I found a regression in loop.sock_connect, introduced in ed17848.

The bug breaks loop.sock_* in a very serious way, making programs that use those methods prone to random hangs after socket is connected.

How to trigger

Let's imagine we have a server, that sends some data (let's say b'hello') to the client immediately after connect. And the client program is the following:

data = await self.recv_all(sock, 5)
assert data == b'hello'
await self.loop.sock_sendall(sock, PAYLOAD)

If the PAYLOAD is big enough, the client program will hang forever.

Explanation

The cause of the hang is a race between callbacks -- one related to loop.sock_connect and one to sock_sendall.

Here's the relevant piece of code from selector_events.py:

    def sock_connect(self, sock, address):
        """Connect to a remote socket at address.

        This method is a coroutine.
        """
        if self._debug and sock.gettimeout() != 0:
            raise ValueError("the socket must be non-blocking")

        fut = self.create_future()
        if hasattr(socket, 'AF_UNIX') and sock.family == socket.AF_UNIX:
            self._sock_connect(fut, sock, address)
        else:
            resolved = base_events._ensure_resolved(
                address, family=sock.family, proto=sock.proto, loop=self)
            resolved.add_done_callback(
                lambda resolved: self._on_resolved(fut, sock, resolved))

        return fut

    def _on_resolved(self, fut, sock, resolved):
        try:
            _, _, _, _, address = resolved.result()[0]
        except Exception as exc:
            fut.set_exception(exc)
        else:
            self._sock_connect(fut, sock, address)

    def _sock_connect(self, fut, sock, address):
        fd = sock.fileno()
        try:
            sock.connect(address)
        except (BlockingIOError, InterruptedError):
            # Issue #23618: When the C function connect() fails with EINTR, the
            # connection runs in background. We have to wait until the socket
            # becomes writable to be notified when the connection succeed or
            # fails.
            fut.add_done_callback(functools.partial(self._sock_connect_done,
                                                    fd))
            self.add_writer(fd, self._sock_connect_cb, fut, sock, address)
        except Exception as exc:
            fut.set_exception(exc)
        else:
            fut.set_result(None)

    def _sock_connect_done(self, fd, fut):
        self.remove_writer(fd)

    def _sock_connect_cb(self, fut, sock, address):
        if fut.cancelled():
            return

        try:
            err = sock.getsockopt(socket.SOL_SOCKET, socket.SO_ERROR)
            if err != 0:
                # Jump to any except clause below.
                raise OSError(err, 'Connect call failed %s' % (address,))
        except (BlockingIOError, InterruptedError):
            # socket is still registered, the callback will be retried later
            pass
        except Exception as exc:
            fut.set_exception(exc)
        else:
            fut.set_result(None)

Before ed17848, sock_connect called _sock_connect directly:

sock_connect created a fut Future.
If the address wasn't already resolved it raised an error.
If the address was resolved, it called _sock_connect, which attached a callback to the fut -- _sock_connect_done.
sock_connect then returned fut to the caller.
If the caller is a coroutine, it's wrapped in asyncio.Task. Therefore, fut now have two callbacks attached to it: [_sock_connect_done, Task._wakeup]

After that commit:

sock_connect creates a fut Future.
Then calls _ensure_resolved (linked fut to the result of that call's Future).
sock_connect returns fut to the caller.
If the caller is a coroutine, its Task will add a callback to the fut, eventually resulting in this: [Task._wakeup, _sock_connect_done]

Therefore, after ed17848, _sock_connect_done can be called after await loop.sock_connect() line. If the program calls loop.sock_sendall after sock_connect, _sock_connect_done will remove writer callback that sock_sendall set up.

/cc @gvanrossum @ajdavis @Haypo

1st1 · 2016-06-28T20:51:14Z

Another change: I had to make sock_connect and actual coroutine. Before this PR it returned a Future (although it was documented that the result of the call is coroutine).

1st1 · 2016-06-28T21:11:42Z

The more I'm trying to fix this thing, the more tests break in interesting ways. Maybe we should just partially revert ed17848.

gvanrossum · 2016-06-28T21:17:41Z

Call in the author of that commit?

1st1 · 2016-06-28T21:20:26Z

Yes, I put Jesse in cc. I was the one who reviewed it though, so the responsibility is on me.

gvanrossum · 2016-06-28T21:22:12Z

Don't beat yourself up. Bugs happen. Life goes on.

1st1 · 2016-06-28T22:03:35Z

Alright, the tests are passing; please review.

ajdavis · 2016-06-28T22:46:57Z

Thanks for continued guidance and effort on this, @1st1. I'll review it soon, if you like.

I keep thinking, though, maybe we should revert this whole idea: the idea of skipping getaddrinfo if we can detect that the address is already resolved? It seems to be a bug factory. First because getaddrinfo's AI_NUMERICHOST is harder to simulate in Python, in all platforms, than we thought. Second, because my attempt to fallback to getaddrinfo in ed17848 introduced an additional yield, which caused the current race condition.

Now that we've made getaddrinfo concurrent on Mac and BSD, getaddrinfo with AI_NUMERICHOST is no longer such a bottleneck. Would rolling back this whole line of changes be simpler and safer than continuing to whack bugs?

1st1 · 2016-06-28T22:52:35Z

Would rolling back this whole line of changes be simpler and safer than continuing to whack bugs?

Probably yes. On the other had, I like how sock_connect works now, accepting any kind of address. Please take a look at my patch.

gvanrossum · 2016-06-28T23:07:54Z

I like Jesse's idea. It's good to be thinking about the maintainability of this code.

…

--Guido (mobile)

ajdavis · 2016-07-03T23:29:51Z

This LGTM, I think. It's getting pretty complicated and my confidence is shaken that either of us can detect bugs by inspection. I propose we merge this--it's valuable both to fix the bug for now, and to add tests to guard against regression.

Then, I think (and Guido agrees) that we should revert this whole idea. So let's keep the new tests, as much as possible, but stop trying to simulate getaddrinfo's AI_NUMERICHOST.

I regret this idea was released in Python 3.5.2. I'm glad it's still "provisional". =)

1st1 · 2016-07-04T20:50:28Z

Then, I think (and Guido agrees) that we should revert this whole idea. So let's keep the new tests, as much as possible, but stop trying to simulate getaddrinfo's AI_NUMERICHOST.

I'm -1 on reverting anything TBH.

Trying to parsing with pton in create_connection and create_server is harmless, at least with the current implementation (that ships with 3.5.2). That's considered to be a good practice actually.
For sock_connect - parsing vs not parsing is not an issue, now we just call loop.getaddrinfo. It's actually a different question: do we want to sock_connect to require resolved addresses or we can make it to resolve them. I think that the latter is much more preferable.

To conclude - I think we should just leave things as is (after this PR is merged).

1st1 · 2016-09-12T01:30:39Z

This fix will go in b2.

1st1 · 2016-09-15T21:35:56Z

Merged in d6dcf25.

vstinner · 2016-09-22T07:50:00Z

The test fails at least on one buildbot, "AMD64 FreeBSD CURRENT Non-Debug 3.x", please have a look:
http://bugs.python.org/issue28176#msg276736

berkerpeksag · 2016-09-30T21:34:10Z

Note that we've removed test_sock_connect_sock_write_race in http://bugs.python.org/issue28283, but it was added back in the last sync. Perhaps we should add a comment?

1st1 · 2016-10-05T21:04:05Z

The test has been removed from CPython and this repo. Closing this PR.

1st1 added 2 commits June 28, 2016 15:02

Fix callbacks race in SelectorLoop.sock_connect.

dc271d6

Fix failing unittests; make sock_connect a coroutine.

777d267

An attempt to fix CI

00f624a

warsaw mentioned this pull request Jul 14, 2016

Enable Travis-CI, disable Gitlab-CI aio-libs/aiosmtpd#3

Closed

1st1 closed this Sep 15, 2016

vstinner reopened this Sep 22, 2016

1st1 closed this Oct 5, 2016

Uh oh!

Conversation

1st1 commented Jun 28, 2016

How to trigger

Explanation

Uh oh!

1st1 commented Jun 28, 2016

Uh oh!

1st1 commented Jun 28, 2016

Uh oh!

gvanrossum commented Jun 28, 2016 via email

Uh oh!

1st1 commented Jun 28, 2016

Uh oh!

gvanrossum commented Jun 28, 2016 via email

Uh oh!

1st1 commented Jun 28, 2016

Uh oh!

ajdavis commented Jun 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1st1 commented Jun 28, 2016

Uh oh!

gvanrossum commented Jun 28, 2016 via email

Uh oh!

ajdavis commented Jul 3, 2016

Uh oh!

1st1 commented Jul 4, 2016

Uh oh!

1st1 commented Sep 12, 2016

Uh oh!

1st1 commented Sep 15, 2016

Uh oh!

vstinner commented Sep 22, 2016

Uh oh!

berkerpeksag commented Sep 30, 2016

Uh oh!

1st1 commented Oct 5, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ajdavis commented Jun 28, 2016 •

edited

Loading