Skip to content

Add UCX Comm#2591

Merged
mrocklin merged 85 commits intodask:masterfrom
quasiben:tom-ucx
May 31, 2019
Merged

Add UCX Comm#2591
mrocklin merged 85 commits intodask:masterfrom
quasiben:tom-ucx

Conversation

@quasiben
Copy link
Copy Markdown
Member

@quasiben quasiben commented Apr 2, 2019

This PR brings in much of the work done by @TomAugspurger with ucx/ucx-py with @mrocklin's help I added a small change to cudf protocol handling and general cleanup.

@mrocklin
Copy link
Copy Markdown
Member

I think #2565 (comment) has the motivation, but I don’t recall details. Not sure about tests.

OK, for now I've turned off compression on all cuda data. That should stop us from splitting up large frames. It looks like currently there is a limit in ucx-py that keeps us under 2**31 bytes. At first this seems to be limited by the int type of the Message._length attribute. Changing that to long causes a segfault, so I'm probably missing something upstream in UCX.

I've also added logic to not call ensure_bytes and b''.join if there is only one element in that list.

@mrocklin
Copy link
Copy Markdown
Member

This now works-ish (at least when combined with some of the work in rapidsai/dask-cuda#46). I would like to get this to a point where we could merge it somewhat quickly.

I don't mind things being a little rough if they are well isolated into files that aren't in the mainline code path (files like ucx.py, cuda.py, cudf.py and so on).

There are a few TODO's left in the actual logic that I suspect are left by @TomAugspurger . Is this something that you can look into this week Tom to see if they are still necessary?

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just gave a quick look through. I'm not sure if I'll have time this week to actually verify the TODOs yet; need to check on where we are for the next pandas release first.

General question: do we want users to provide the prefix ucx:// or ucp://?

# Workaround for hanging test in
# pytest distributed/comm/tests/test_ucx.py::test_comm_objs -vs --count=2
# on the second time through.
ucp._libs.ucp_py.reader_added = 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Akshay-Venkatesh do you recall if this was resolved? Is it the same as https://github.com/Akshay-Venkatesh/ucx-py/issues/69, or different?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger I tested this yesterday and the issue hasn't been resolved. This still has to be fixed.

@mrocklin
Copy link
Copy Markdown
Member

Thanks for the feedback @TomAugspurger . I think I can handle everything, I mostly wanted to get your thoughts on some of the comments. Thanks!

@mrocklin mrocklin changed the title [WIP] UCX work UCX work May 30, 2019
@mrocklin mrocklin changed the title UCX work Add UCX Comm May 30, 2019
@mrocklin
Copy link
Copy Markdown
Member

I've removed the WIP label. Review appreciated. I think that this is safe to go in.

@mrocklin
Copy link
Copy Markdown
Member

OK, I'm merging this tomorrow if there are no further comments.

@TomAugspurger
Copy link
Copy Markdown
Member

TomAugspurger commented May 31, 2019 via email

@mrocklin mrocklin merged commit a8504d6 into dask:master May 31, 2019
@TomAugspurger
Copy link
Copy Markdown
Member

🎉

@mrocklin
Copy link
Copy Markdown
Member

Thank you @TomAugspurger, @quasiben, and @Akshay-Venkatesh for working on this. I'm sure that there is still plenty more to do here, but it will be nice to have this in master.

muammar added a commit to muammar/distributed that referenced this pull request Jun 12, 2019
* upstream/master: (58 commits)
  Add unknown pytest markers (dask#2764)
  Delay lookup of allowed failures. (dask#2761)
  Change address -> worker in ColumnDataSource for nbytes plot (dask#2755)
  Remove module state in Prometheus Handlers (dask#2760)
  Add stress test for UCX (dask#2759)
  Add nanny logs (dask#2744)
  Move some of the adaptive logic into the scheduler (dask#2735)
  Add SpecCluster.new_worker_spec method (dask#2751)
  Worker dashboard fixes (dask#2747)
  Add async context managers to scheduler/worker classes (dask#2745)
  Fix the resource key representation before sending graphs (dask#2716) (dask#2733)
  Allow user to configure whether workers are daemon. (dask#2739)
  Pin pytest >=4 with pip in appveyor and python 3.5 (dask#2737)
  Add Experimental UCX Comm (dask#2591)
  Close nannies gracefully (dask#2731)
  add kwargs to progressbars (dask#2638)
  Add back LocalCluster.__repr__. (dask#2732)
  Move bokeh module to dashboard (dask#2724)
  Close clusters at exit (dask#2730)
  Add SchedulerPlugin TaskState example (dask#2622)
  ...
muammar added a commit to muammar/distributed that referenced this pull request Jul 18, 2019
* upstream/master: (43 commits)
  Add unknown pytest markers (dask#2764)
  Delay lookup of allowed failures. (dask#2761)
  Change address -> worker in ColumnDataSource for nbytes plot (dask#2755)
  Remove module state in Prometheus Handlers (dask#2760)
  Add stress test for UCX (dask#2759)
  Add nanny logs (dask#2744)
  Move some of the adaptive logic into the scheduler (dask#2735)
  Add SpecCluster.new_worker_spec method (dask#2751)
  Worker dashboard fixes (dask#2747)
  Add async context managers to scheduler/worker classes (dask#2745)
  Fix the resource key representation before sending graphs (dask#2716) (dask#2733)
  Allow user to configure whether workers are daemon. (dask#2739)
  Pin pytest >=4 with pip in appveyor and python 3.5 (dask#2737)
  Add Experimental UCX Comm (dask#2591)
  Close nannies gracefully (dask#2731)
  add kwargs to progressbars (dask#2638)
  Add back LocalCluster.__repr__. (dask#2732)
  Move bokeh module to dashboard (dask#2724)
  Close clusters at exit (dask#2730)
  Add SchedulerPlugin TaskState example (dask#2622)
  ...
Comment on lines +179 to 181
header = bytes(header)
if header:
header = msgpack.loads(header, use_list=False, **msgpack_opts)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you recall why this was needed? Was this due to the if header line? Did msgpack.dumps need this? Or was it due to something else like potentially unusual types being passed in for header (like maybe a NumPy array)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I guess this is explained here ( 44c1d5c ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants