Add a hard-coded DGX configuration by mrocklin · Pull Request #46 · rapidsai/dask-cuda

mrocklin · 2019-05-09T16:37:53Z

This currently depends on github.com/mrocklin/distributed@dev

mrocklin · 2019-05-09T16:40:53Z

Currently benchmarking with

from dask_cuda import DGX
from dask.distributed import Client, wait
cluster = DGX()
client = Client(cluster)

import cupy, dask.array as da, dask
rs = da.random.RandomState(RandomState=cupy.random.RandomState)
x = rs.normal(10, 1, size=(50000, 50000)).persist()
y = (x + x.T).sum().compute()

Script

from dask_cuda import DGX
from dask.distributed import Client, wait
import cupy, dask.array as da, dask

with DGX() as cluster:
    with Client(cluster) as client
        rs = da.random.RandomState(RandomState=cupy.random.RandomState)
        x = rs.normal(10, 1, size=(50000, 50000)).persist()
        wait(x)
        y = (x + x.T).sum().compute()

dask_cuda/dgx.py

mrocklin · 2019-05-09T16:43:18Z

@Akshay-Venkatesh @rjzamora if you all have a chance, here is a system that, I think, should set up a DGX properly. I'm still not able to use the UCX_NET_DEVICES environment variable effectively.

rjzamora · 2019-05-09T17:16:41Z

dask_cuda/dgx.py

+    scheduler = {
+        "cls": Scheduler,
+        "options": {
+            "interface": "ib0",


It might make sense to use a different interface for the scheduler in the case that none of your workers are running on gpu-0 or gpu-1

rjzamora · 2019-05-09T21:13:20Z

@mrocklin - I am not able to run the benchmark you shared. My env:

mrocklin/distributed@dev
mrocklin/dask-cuda@dgx-1
Akshay-Venkatesh/ucx@1ad5b17/ucx-cuda
Akshay-Venkatesh/ucx-py@devel

Anything I am getting wrong here?

mrocklin · 2019-05-09T21:14:52Z

That environment looks fine to me. Let me push a couple changes to make it easy to get a dashboard up, and then lets jump on a screenshare to see what's happening.

mrocklin · 2019-05-09T22:43:45Z

In offline conversation @Akshay-Venkatesh mentioned

Is the cuda_visible_devices env param consistent across dask workers for this test? That would be one explanation of why you would see the invalid device context error

Then I said

In [1]: from dask_cuda.local_cuda_cluster import cuda_visible_devices

In [2]: for i in range(8):
   ...:     print(i, cuda_visible_devices(i, range(8)))
   ...:
0 0,1,2,3,4,5,6,7
1 1,2,3,4,5,6,7,0
2 2,3,4,5,6,7,0,1
3 3,4,5,6,7,0,1,2
4 4,5,6,7,0,1,2,3
5 5,6,7,0,1,2,3,4
6 6,7,0,1,2,3,4,5
7 7,0,1,2,3,4,5,6

Then @Akshay-Venkatesh

is it not possible to select appropriate devices from within the python program having given all workers the same env param?

Then me again

How can we select the appropriate device from within Python? And actually, can I ask that we move this to GitHub?

And here we are

mrocklin · 2019-05-09T22:46:32Z

is it not possible to select appropriate devices from within the python program having given all workers the same env param?

Ah, so if you mean, can we ask that CuPy use a particular GPU, then yes, cupy has API for that. cuDF may have API for that (actually I don't think that it does) but it would be different. Same with TensorFlow and PyTorch.

Unfortunately, the only consistent way to have user code prefer a particular GPU today that is cross-library is to use CUDA_VISIBLE_DEVICES.

mrocklin · 2019-05-09T22:46:47Z

I can restrict the visible devices to just one GPU though, maybe that will help?

Akshay-Venkatesh · 2019-05-09T22:47:35Z

How can we select the appropriate device from within Python? And actually, can I ask that we move this to GitHub?

@mrocklin I'm pasting what I added here:

akvenkatesh@prm-dgx-33:~/ucx-py$ git diff benchmarks/recv-into-client.py
diff --git a/benchmarks/recv-into-client.py b/benchmarks/recv-into-client.py
index bba9eb8..4ef1f15 100644
--- a/benchmarks/recv-into-client.py
+++ b/benchmarks/recv-into-client.py
@@ -134,9 +134,15 @@ async def main(args=None):
         import cupy as xp
 
     if args.server:
+        if args.object_type == 'cupy':
+            xp.cuda.runtime.setDevice(0)
+            print(xp.cuda.runtime.getDevice())
         await connect(args.server, args.port, args.n_bytes, args.n_iter,
                       args.recv, xp, args.verbose, args.inc)
     else:
+        if args.object_type == 'cupy':
+            xp.cuda.runtime.setDevice(1)
+            print(xp.cuda.runtime.getDevice())
         await serve(args.port, args.n_bytes, args.n_iter,
                     args.recv, xp, args.verbose, args.inc)

Akshay-Venkatesh · 2019-05-09T22:49:26Z

I can restrict the visible devices to just one GPU though, maybe that will help?

That may work for this test but I assume you'd like to be able to use all GPUs for most workloads. Am I misinterpreting this?

mrocklin · 2019-05-09T22:54:36Z

I tried the following spec, which allows only one visible device per process, and didn't see any improvement.

    spec = {
        i: {
            "cls": Nanny,
            "options": {
                "env": {
                    "CUDA_VISIBLE_DEVICES": str(i),  # <<<----- this is the major change
                    # 'UCX_NET_DEVICES': 'mlx5_%d:1' % (i // 2)
                },
                "interface": "ib%d" % (i // 2),
                "protocol": "ucx",
                "ncores": 1,
            },
        }
        for i in range(8)
    }

mrocklin · 2019-05-09T22:56:50Z

Right, so the challenge with the xp.cuda.runtime.setDevice approach is that we probably can't do that for all possible libraries that the user might want to use. Unfortunately there isn't any standard API for CUDA itself in Python.

mrocklin · 2019-05-09T22:57:39Z

That may work for this test but I assume you'd like to be able to use all GPUs for most workloads. Am I misinterpreting this?

Yes, but I'm comfortable with each dask worker having access to only one of them. I have one dask worker per GPU.

Regardless, it didn't seem to solve my immediate problem.

Akshay-Venkatesh · 2019-05-09T23:11:39Z

are you seeing these errors or something else?

cuda_ipc_md.c:62   UCX  ERROR cuCtxGetDevice(&cu_device) is failed. ret:invalid device context
[1557433464.508925] [dgx15:55612:0]       ucp_rkey.c:250  UCX  ERROR Failed to unpack remote key from remote md[5]: Input/output error
[dgx15:55612:0:55931] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x20)

If yes, do you mind trying UCX_TLS=rc,sm,tcp,cuda_copy ? Since the error is showing up in cuda_ipc transport I'm trying to avoid using that.

rjzamora · 2019-05-09T23:20:55Z

Setting UCX_TLS=rc,sm,tcp,cuda_copy helped for me. I also needed to set a non-default port for the scheduler on dgx15 (e.g. "port": 8788).

mrocklin · 2019-05-09T23:51:01Z

Right, so the challenge with the xp.cuda.runtime.setDevice approach is that we probably can't do that for all possible libraries that the user might want to use. Unfortunately there isn't any standard API for CUDA itself in Python.

@Akshay-Venkatesh would it make sense for UCX to respect the preference implied by CUDA_VISIBLE_DEVICES where the first value is used by default? This seems to be the strongest convention on this topic that I've seen so far (though I'm fairly new).

Akshay-Venkatesh · 2019-05-09T23:53:40Z

I think it's possible. Needs some changes in UCX but I see good reason to do it now. I'll push a PR to address this.

…

On Thu, May 9, 2019, 7:51 PM Matthew Rocklin ***@***.***> wrote: Right, so the challenge with the xp.cuda.runtime.setDevice approach is that we probably can't do that for all possible libraries that the user might want to use. Unfortunately there isn't any standard API for CUDA itself in Python. @Akshay-Venkatesh <https://github.com/Akshay-Venkatesh> would it make sense for UCX to respect the preference implied by CUDA_VISIBLE_DEVICES where the first value is used by default? This seems to be the strongest convention on this topic that I've seen so far (though I'm fairly new). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#46 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAYNWAU4ALKNCJLVHPBHA2TPUS2GNANCNFSM4HL4EPQA> .

mrocklin · 2019-05-09T23:54:56Z

Thanks @Akshay-Venkatesh . It sounds like many of the issues are consistent with that problem. Hopefully that helps to resolve things. If there is a development branch that I can build from (that also includes your other changes) please let me know. I'd be keen to try it out.

Akshay-Venkatesh · 2019-05-10T00:38:26Z

I haven't completed the PR yet but I'll definitely post here when the PR is complete.

mrocklin · 2019-05-29T17:28:15Z

This repository will start depending on the dask/distributed master branch.

Is it ok to place this in the ci/cpu/build.sh and ci/gpu/build.sh files?

cc @rlratzel , who I think has been thinking about cross-project testing.

mrocklin · 2019-05-30T14:20:59Z

This repository will start depending on the dask/distributed master branch.

Is it ok to place this in the ci/cpu/build.sh and ci/gpu/build.sh files?

cc @rlratzel , who I think has been thinking about cross-project testing.

Or maybe @raydouglass knows the answer to this?

rlratzel · 2019-05-30T15:04:49Z

This repository will start depending on the dask/distributed master branch.

Is it ok to place this in the ci/cpu/build.sh and ci/gpu/build.sh files?

Is the dependency needed just for running tests, or does the library itself require it?

If it's only needed for tests, then I believe it's a fairly common pattern to add a pip install line to the gpu/cpu scripts. As a nicety to future maintainers, I've found it helpful to add checks for special dependencies any tests need in the corresponding setup, with a clear error message if they're missing.

If the library itself needs it, then we'd obviously want to ensure it's installed with the library's conda package (meta.yml/requirements.txt). I'm not familiar with dask/distributed, but is that an option if the library depends on it?

mrocklin · 2019-05-30T15:57:27Z

The library itself now depends on it. We'll have to release distributed before the next release of dask-cuda

mrocklin · 2019-05-31T16:47:59Z

@rlratzel given the information above, what is the best approach here?

mrocklin · 2019-06-03T18:03:27Z

If the library itself needs it, then we'd obviously want to ensure it's installed with the library's conda package (meta.yml/requirements.txt). I'm not familiar with dask/distributed, but is that an option if the library depends on it?

So, yes, it will depend on the most recent version, which is not currently released. It seems odd to put a dev version in a requirements.txt/meta.yml file. I can though if you prefer.

If I don't get a follow up here soonish I'm going to just add a pip install line in the CI script.

raydouglass · 2019-06-03T18:12:23Z

So, yes, it will depend on the most recent version, which is not currently released. It seems odd to put a dev version in a requirements.txt/meta.yml file. I can though if you prefer.

I think this would be necessary though right? Otherwise the conda packages won't install the right dependencies.

If I don't get a follow up here soonish I'm going to just add a pip install line in the CI script.

When is distributed being released? Can this PR wait for that?

mrocklin · 2019-06-03T18:20:46Z

I think this would be necessary though right? Otherwise the conda packages won't install the right dependencies.

How do we handle other packages on which we depend on master, like cudf?

When is distributed being released? Can this PR wait for that?

A few weeks probably (there is a lot of churn in the upcoming release), but before RAPIDS 0.8.

Currently this repository won't work with master branch of distributed. This PR will fix those issues.

mrocklin · 2019-06-03T21:42:09Z

Any further suggestions @raydouglass ? I think it's ok to merge something so that master works with master, but I'm open to other suggestions.

raydouglass · 2019-06-04T14:18:19Z

Any further suggestions @raydouglass ? I think it's ok to merge something so that master works with master, but I'm open to other suggestions.

The only suggestion I have is to wait which isn't ideal. So maybe make the pip install changes. Please open an issue to remove those changes after distributed is released. That way we'll have a record and won't forget to do it before v0.8 release.

mrocklin · 2019-06-05T21:27:19Z

rerun tests

mrocklin · 2019-06-05T22:47:55Z

If it's only needed for tests, then I believe it's a fairly common pattern to add a pip install line to the gpu/cpu scripts.

This didn't suffice. The conda build tests fail. How would you all like this handled?

raydouglass · 2019-06-06T15:31:26Z

The pip install has to be in conda/recipes/dask-cuda/build.sh script. During a conda build, a new environment is created, so environment changes in CI scripts aren't transferred over.

mrocklin · 2019-06-06T16:40:54Z

Ah crap. It looks like this was targetting branch 0.7 when I merged. My apologies. What do I need to do to correct this on the ops end? I can easily move the commit over to 0.8, but I'm concerned that I might have messed with one of your internal systems.

mrocklin · 2019-06-06T17:18:18Z

If it was up to me I would probably just remove the commit and force-push, or add a revert commit.

kkraus14 · 2019-06-21T18:49:08Z

@raydouglass @mike-wendt need you to jump in here to resolve this.

mrocklin added 2 commits May 9, 2019 09:33

ip -> host in LocalCUDACluster

dcd50dc

Add DGX cluster using SpecCluster

e650294

remove attributes specific to my machine

b03fcab

mrocklin commented May 9, 2019

View reviewed changes

dask_cuda/dgx.py Show resolved Hide resolved

rjzamora reviewed May 9, 2019

View reviewed changes

Add dashboard address to correct network interface to Scheduler

833fd02

mrocklin added 6 commits May 20, 2019 15:56

Use only available GPUs in CVD

fd8c549

Add Worker preload script to initialize the cuda context

7f2c2c2

Add interface= keyword and CPU affinity

03039a9

Add keyword arguments to DGX

9e6194b

Add docstring for DGX

a149a50

Update LocalCUDACluster to SpecCluster

8ebb0a8

Skip dgx test if no interface ib0

e171d17

mrocklin force-pushed the dgx-1 branch 2 times, most recently from 1baac6e to a83d89c Compare May 28, 2019 21:32

bokeh -> dashboard

3beb1d1

mrocklin force-pushed the dgx-1 branch from a83d89c to 3beb1d1 Compare May 28, 2019 21:37

pentschev mentioned this pull request May 29, 2019

[FEA] LocalCUDACluster should support passing a list of GPU IDs to use as workers #56

Closed

mrocklin mentioned this pull request Jun 3, 2019

Add UCX DGX draft dask/dask-blog#28

Merged

Add memory-limit and scheduler-port keywords

6c54c76

Use LocalCluster under LocalCUDACluster

1ecffa6

add distributed master to ci

25989c0

add pip install line to conda recipe

7e6bfda

mrocklin merged commit 7606157 into rapidsai:branch-0.7 Jun 6, 2019

mrocklin added a commit to mrocklin/dask-cuda that referenced this pull request Jun 6, 2019

Add a hard-coded DGX configuration (rapidsai#46)

c2656f6

mrocklin added a commit that referenced this pull request Jun 6, 2019

Add a hard-coded DGX configuration (#46) (#70)

45ac70c

Conversation

mrocklin commented May 9, 2019

Uh oh!

mrocklin commented May 9, 2019

Script

Uh oh!

Uh oh!

mrocklin commented May 9, 2019

Uh oh!

rjzamora May 9, 2019

Choose a reason for hiding this comment

Uh oh!

rjzamora commented May 9, 2019

Uh oh!

mrocklin commented May 9, 2019

Uh oh!

mrocklin commented May 9, 2019

Uh oh!

mrocklin commented May 9, 2019

Uh oh!

mrocklin commented May 9, 2019

Uh oh!

Akshay-Venkatesh commented May 9, 2019

Uh oh!

Akshay-Venkatesh commented May 9, 2019

Uh oh!

mrocklin commented May 9, 2019

Uh oh!

mrocklin commented May 9, 2019

Uh oh!

mrocklin commented May 9, 2019

Uh oh!

Akshay-Venkatesh commented May 9, 2019

Uh oh!

rjzamora commented May 9, 2019

Uh oh!

mrocklin commented May 9, 2019

Uh oh!

Akshay-Venkatesh commented May 9, 2019 via email

Uh oh!

mrocklin commented May 9, 2019

Uh oh!

Akshay-Venkatesh commented May 10, 2019

Uh oh!

mrocklin commented May 29, 2019

Uh oh!

mrocklin commented May 30, 2019

Uh oh!

rlratzel commented May 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented May 30, 2019

Uh oh!

mrocklin commented May 31, 2019

Uh oh!

mrocklin commented Jun 3, 2019

Uh oh!

raydouglass commented Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Jun 3, 2019

Uh oh!

mrocklin commented Jun 3, 2019

Uh oh!

raydouglass commented Jun 4, 2019

Uh oh!

mrocklin commented Jun 5, 2019

Uh oh!

mrocklin commented Jun 5, 2019

Uh oh!

raydouglass commented Jun 6, 2019

Uh oh!

mrocklin commented Jun 6, 2019

Uh oh!

mrocklin commented Jun 6, 2019

Uh oh!

kkraus14 commented Jun 21, 2019

Uh oh!

Reviewers

Assignees

rlratzel commented May 30, 2019 •

edited

Loading

raydouglass commented Jun 3, 2019 •

edited

Loading