Add UCX Comm by quasiben · Pull Request #2591 · dask/distributed

quasiben · 2019-04-02T03:25:02Z

This PR brings in much of the work done by @TomAugspurger with ucx/ucx-py with @mrocklin's help I added a small change to cudf protocol handling and general cleanup.

* Stubs for classes

…distributed into ucx+data-handling

distributed/protocol/utils.py

distributed/protocol/cupy.py

distributed/protocol/cudf.py

distributed/deploy/tests/test_local.py

distributed/comm/ucx.py

mrocklin · 2019-05-27T21:56:15Z

I think #2565 (comment) has the motivation, but I don’t recall details. Not sure about tests.

OK, for now I've turned off compression on all cuda data. That should stop us from splitting up large frames. It looks like currently there is a limit in ucx-py that keeps us under 2**31 bytes. At first this seems to be limited by the int type of the Message._length attribute. Changing that to long causes a segfault, so I'm probably missing something upstream in UCX.

I've also added logic to not call ensure_bytes and b''.join if there is only one element in that list.

distributed/core.py

mrocklin · 2019-05-27T22:12:08Z

This now works-ish (at least when combined with some of the work in rapidsai/dask-cuda#46). I would like to get this to a point where we could merge it somewhat quickly.

I don't mind things being a little rough if they are well isolated into files that aren't in the mainline code path (files like ucx.py, cuda.py, cudf.py and so on).

There are a few TODO's left in the actual logic that I suspect are left by @TomAugspurger . Is this something that you can look into this week Tom to see if they are still necessary?

TomAugspurger

Just gave a quick look through. I'm not sure if I'll have time this week to actually verify the TODOs yet; need to check on where we are for the next pandas release first.

General question: do we want users to provide the prefix ucx:// or ucp://?

distributed/comm/tests/test_ucx.py

TomAugspurger · 2019-05-28T13:24:04Z

distributed/comm/tests/test_ucx.py

+    # Workaround for hanging test in
+    # pytest distributed/comm/tests/test_ucx.py::test_comm_objs -vs --count=2
+    # on the second time through.
+    ucp._libs.ucp_py.reader_added = 0


@Akshay-Venkatesh do you recall if this was resolved? Is it the same as https://github.com/Akshay-Venkatesh/ucx-py/issues/69, or different?

@TomAugspurger I tested this yesterday and the issue hasn't been resolved. This still has to be fixed.

distributed/comm/tests/test_ucx.py

distributed/comm/ucx.py

distributed/core.py

mrocklin · 2019-05-28T16:25:13Z

Thanks for the feedback @TomAugspurger . I think I can handle everything, I mostly wanted to get your thoughts on some of the comments. Thanks!

mrocklin · 2019-05-30T14:25:18Z

I've removed the WIP label. Review appreciated. I think that this is safe to go in.

mrocklin · 2019-05-30T16:42:15Z

OK, I'm merging this tomorrow if there are no further comments.

TomAugspurger · 2019-05-31T08:28:25Z

I won't have time soon to look this over again, but based on my last review I also think this is safe to go in.

…

On Thu, May 30, 2019 at 9:25 AM Matthew Rocklin ***@***.***> wrote: I've removed the WIP label. Review appreciated. I think that this is safe to go in. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2591?email_source=notifications&email_token=AAKAOISDTFT25GT7376J7STPX7PVBA5CNFSM4HC4HRZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWSO7DA#issuecomment-497348492>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIXTD4ITBJEBKWUJ5LLPX7PVBANCNFSM4HC4HRZQ> .

TomAugspurger · 2019-05-31T14:01:48Z

🎉

mrocklin · 2019-05-31T14:03:04Z

Thank you @TomAugspurger, @quasiben, and @Akshay-Venkatesh for working on this. I'm sure that there is still plenty more to do here, but it will be nice to have this in master.

* upstream/master: (58 commits) Add unknown pytest markers (dask#2764) Delay lookup of allowed failures. (dask#2761) Change address -> worker in ColumnDataSource for nbytes plot (dask#2755) Remove module state in Prometheus Handlers (dask#2760) Add stress test for UCX (dask#2759) Add nanny logs (dask#2744) Move some of the adaptive logic into the scheduler (dask#2735) Add SpecCluster.new_worker_spec method (dask#2751) Worker dashboard fixes (dask#2747) Add async context managers to scheduler/worker classes (dask#2745) Fix the resource key representation before sending graphs (dask#2716) (dask#2733) Allow user to configure whether workers are daemon. (dask#2739) Pin pytest >=4 with pip in appveyor and python 3.5 (dask#2737) Add Experimental UCX Comm (dask#2591) Close nannies gracefully (dask#2731) add kwargs to progressbars (dask#2638) Add back LocalCluster.__repr__. (dask#2732) Move bokeh module to dashboard (dask#2724) Close clusters at exit (dask#2730) Add SchedulerPlugin TaskState example (dask#2622) ...

* upstream/master: (43 commits) Add unknown pytest markers (dask#2764) Delay lookup of allowed failures. (dask#2761) Change address -> worker in ColumnDataSource for nbytes plot (dask#2755) Remove module state in Prometheus Handlers (dask#2760) Add stress test for UCX (dask#2759) Add nanny logs (dask#2744) Move some of the adaptive logic into the scheduler (dask#2735) Add SpecCluster.new_worker_spec method (dask#2751) Worker dashboard fixes (dask#2747) Add async context managers to scheduler/worker classes (dask#2745) Fix the resource key representation before sending graphs (dask#2716) (dask#2733) Allow user to configure whether workers are daemon. (dask#2739) Pin pytest >=4 with pip in appveyor and python 3.5 (dask#2737) Add Experimental UCX Comm (dask#2591) Close nannies gracefully (dask#2731) add kwargs to progressbars (dask#2638) Add back LocalCluster.__repr__. (dask#2732) Move bokeh module to dashboard (dask#2724) Close clusters at exit (dask#2730) Add SchedulerPlugin TaskState example (dask#2622) ...

jakirkham · 2021-01-07T18:57:40Z

distributed/protocol/core.py

+    header = bytes(header)
    if header:
        header = msgpack.loads(header, use_list=False, **msgpack_opts)


Do you recall why this was needed? Was this due to the if header line? Did msgpack.dumps need this? Or was it due to something else like potentially unusual types being passed in for header (like maybe a NumPy array)?

Ah I guess this is explained here ( 44c1d5c ).

Tom Augspurger and others added 24 commits February 19, 2019 07:32

ENH: UCX-based Comms

a9c17f8

* Stubs for classes

CUDA failing

65bd972

fixups

4fc6acd

wip

d64c5cc

zero copy

b28668b

wip

4fcafae

BUG: Ensure proper cleanup in comm_pair tests

f33ba29

Reset reader_added before listening

b61e56d

CUDA failing

89df2bd

fixups

5b51716

wip

a71c896

zero copy

fb862bf

Merge branch 'ucx+data-handling' of https://github.com/TomAugspurger/…

6a194a3

…distributed into ucx+data-handling

fix registration

594028c

all passing

120bc2f

cleanup

2def137

move override to the test

fcb800a

rename

8745a19

remove old tests

1339a3e

todos

5d0d993

Send headers

8bc2dbb

use nbytes

3e998ce

let internal protocol machiner set lengths

d4b3501

clean up

b89717e