Skip to content

skip getaddrinfo thread if host is already resolved, using socket.AI_NUMERIC*#6847

Merged
fjetter merged 3 commits intodask:mainfrom
graingert:skip-getaddrinfo-on-thread
Aug 10, 2022
Merged

skip getaddrinfo thread if host is already resolved, using socket.AI_NUMERIC*#6847
fjetter merged 3 commits intodask:mainfrom
graingert:skip-getaddrinfo-on-thread

Conversation

@graingert
Copy link
Member

should help with #6846

  • Tests added / passed
  • Passes pre-commit run --all-files

@graingert graingert force-pushed the skip-getaddrinfo-on-thread branch from 7c9b3e3 to 4865250 Compare August 8, 2022 10:30
Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this helps, I'm +1. I think the complexity here is not too bad, even if the patch is not accepted in CPython.

Regarding the OSX slowness. Under which conditions do we experience any slowness? I'm worried about further slowing down our CI since OSX jobs are already the slowest.

Is this "slowness" significant? Would a user working on a mac book notice this when connecting to a cluster?

@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 46m 1s ⏱️ + 11m 11s
  2 990 tests ±0    2 902 ✔️ +3       88 💤 ±0  0  - 3 
22 173 runs  ±0  21 129 ✔️ +4  1 044 💤  - 1  0  - 3 

Results for commit 09c75e9. ± Comparison against base commit 3f91ad3.

♻️ This comment has been updated with latest results.

@graingert
Copy link
Member Author

graingert commented Aug 8, 2022

Is this "slowness" significant? Would a user working on a mac book notice this when connecting to a cluster?

The numeric path should never hit the kernel or a lock, it's supposed to be implemented with just plain bytes parsing
so if it is working as intended, it is executing a couple lines of C code which is basically instantaneous
if it is indeed slow, i.e. a couple of milliseconds would already be very slow, we may need to do something about it

however our choice of what to do when this is slow is limited - we don't know if it's slow until we call it, and we can only know it was too slow after the damage has already been done and we wakeup
we could time the blocking call and throw an error to let the user know: raise RuntimeErrro("HEY this is a mythical bug if you can reproduce this go to https://github.com/dask/distributed/issues")

eg something like this:

def _getaddrinfo_debug(loop, host, port, family, type, proto, flags):
    # based on https://github.com/python/cpython/blob/63140b445e4a303df430b3d60c1cd4ef34f27c03/Lib/asyncio/base_events.py#L833-L856
    t0 = loop.time()
    addrinfo = socket.getaddrinfo(host, port, family, type, proto, flags)
    dt = loop.time() - t0

    if dt >= loop.slow_callback_duration:
        msg = [f"{host}:{port!r}"]
        if family:
            msg.append(f'family={family!r}')
        if type:
            msg.append(f'type={type!r}')
        if proto:
            msg.append(f'proto={proto!r}')
        if flags:
            msg.append(f'flags={flags!r}')
        msg = ', '.join(msg)
        raise RuntimeError(f'Getting address info {msg} took {dt * 1e3:.3f}ms: {addrinfo!r}')
    return addrinfo

@graingert graingert marked this pull request as ready for review August 8, 2022 14:09
@graingert
Copy link
Member Author

def _getaddrinfo_debug(loop, host, port, family, type, proto, flags):
    # based on https://github.com/python/cpython/blob/63140b445e4a303df430b3d60c1cd4ef34f27c03/Lib/asyncio/base_events.py#L833-L856
    t0 = loop.time()
    addrinfo = socket.getaddrinfo(host, port, family, type, proto, flags)
    dt = loop.time() - t0

    if dt >= loop.slow_callback_duration:
        msg = [f"{host}:{port!r}"]
        if family:
            msg.append(f'family={family!r}')
        if type:
            msg.append(f'type={type!r}')
        if proto:
            msg.append(f'proto={proto!r}')
        if flags:
            msg.append(f'flags={flags!r}')
        msg = ', '.join(msg)
        raise RuntimeError(f'Getting address info {msg} took {dt * 1e3:.3f}ms: {addrinfo!r}')
    return addrinfo

Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could time the blocking call and throw an error to let the user know

This is vaguely tempting to me mostly because it costs very little to check, but could be nice to have running in CI. If the error never shows up then it can make us more confident this was the right thing to do.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this only affects OSX machines, if the problem even exists, I'm OK with this without any additional debugging logs.
The only use case for OSX are LocalClusters and Clients and I don't think slow DNS should be a topic for local clusters.
We do have the "tick warning" implemented and should already get warnings if the loop is blocked for a prolonged period of time on a cluster.
Clients would not be impacted by this since the event loop is most of the time idle anyhow on clients.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants