Skip to content

Add a hard-coded DGX configuration#46

Merged
mrocklin merged 18 commits intorapidsai:branch-0.7from
mrocklin:dgx-1
Jun 6, 2019
Merged

Add a hard-coded DGX configuration#46
mrocklin merged 18 commits intorapidsai:branch-0.7from
mrocklin:dgx-1

Conversation

@mrocklin
Copy link
Copy Markdown
Contributor

@mrocklin mrocklin commented May 9, 2019

This currently depends on github.com/mrocklin/distributed@dev

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented May 9, 2019

Currently benchmarking with

from dask_cuda import DGX
from dask.distributed import Client, wait
cluster = DGX()
client = Client(cluster)

import cupy, dask.array as da, dask
rs = da.random.RandomState(RandomState=cupy.random.RandomState)
x = rs.normal(10, 1, size=(50000, 50000)).persist()
y = (x + x.T).sum().compute()

Script

from dask_cuda import DGX
from dask.distributed import Client, wait
import cupy, dask.array as da, dask

with DGX() as cluster:
    with Client(cluster) as client
        rs = da.random.RandomState(RandomState=cupy.random.RandomState)
        x = rs.normal(10, 1, size=(50000, 50000)).persist()
        wait(x)
        y = (x + x.T).sum().compute()

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented May 9, 2019

@Akshay-Venkatesh @rjzamora if you all have a chance, here is a system that, I think, should set up a DGX properly. I'm still not able to use the UCX_NET_DEVICES environment variable effectively.

dask_cuda/dgx.py Outdated
scheduler = {
"cls": Scheduler,
"options": {
"interface": "ib0",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to use a different interface for the scheduler in the case that none of your workers are running on gpu-0 or gpu-1

@rjzamora
Copy link
Copy Markdown
Member

rjzamora commented May 9, 2019

@mrocklin - I am not able to run the benchmark you shared. My env:

mrocklin/distributed@dev
mrocklin/dask-cuda@dgx-1
Akshay-Venkatesh/ucx@1ad5b17/ucx-cuda
Akshay-Venkatesh/ucx-py@devel

Anything I am getting wrong here?

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented May 9, 2019

That environment looks fine to me. Let me push a couple changes to make it easy to get a dashboard up, and then lets jump on a screenshare to see what's happening.

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented May 9, 2019

In offline conversation @Akshay-Venkatesh mentioned

Is the cuda_visible_devices env param consistent across dask workers for this test? That would be one explanation of why you would see the invalid device context error

Then I said

In [1]: from dask_cuda.local_cuda_cluster import cuda_visible_devices

In [2]: for i in range(8):
   ...:     print(i, cuda_visible_devices(i, range(8)))
   ...:
0 0,1,2,3,4,5,6,7
1 1,2,3,4,5,6,7,0
2 2,3,4,5,6,7,0,1
3 3,4,5,6,7,0,1,2
4 4,5,6,7,0,1,2,3
5 5,6,7,0,1,2,3,4
6 6,7,0,1,2,3,4,5
7 7,0,1,2,3,4,5,6

Then @Akshay-Venkatesh

is it not possible to select appropriate devices from within the python program having given all workers the same env param?

Then me again

How can we select the appropriate device from within Python? And actually, can I ask that we move this to GitHub?

And here we are

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented May 9, 2019

is it not possible to select appropriate devices from within the python program having given all workers the same env param?

Ah, so if you mean, can we ask that CuPy use a particular GPU, then yes, cupy has API for that. cuDF may have API for that (actually I don't think that it does) but it would be different. Same with TensorFlow and PyTorch.

Unfortunately, the only consistent way to have user code prefer a particular GPU today that is cross-library is to use CUDA_VISIBLE_DEVICES.

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented May 9, 2019

I can restrict the visible devices to just one GPU though, maybe that will help?

@Akshay-Venkatesh
Copy link
Copy Markdown

How can we select the appropriate device from within Python? And actually, can I ask that we move this to GitHub?

@mrocklin I'm pasting what I added here:

akvenkatesh@prm-dgx-33:~/ucx-py$ git diff benchmarks/recv-into-client.py
diff --git a/benchmarks/recv-into-client.py b/benchmarks/recv-into-client.py
index bba9eb8..4ef1f15 100644
--- a/benchmarks/recv-into-client.py
+++ b/benchmarks/recv-into-client.py
@@ -134,9 +134,15 @@ async def main(args=None):
         import cupy as xp
 
     if args.server:
+        if args.object_type == 'cupy':
+            xp.cuda.runtime.setDevice(0)
+            print(xp.cuda.runtime.getDevice())
         await connect(args.server, args.port, args.n_bytes, args.n_iter,
                       args.recv, xp, args.verbose, args.inc)
     else:
+        if args.object_type == 'cupy':
+            xp.cuda.runtime.setDevice(1)
+            print(xp.cuda.runtime.getDevice())
         await serve(args.port, args.n_bytes, args.n_iter,
                     args.recv, xp, args.verbose, args.inc)

@Akshay-Venkatesh
Copy link
Copy Markdown

I can restrict the visible devices to just one GPU though, maybe that will help?

That may work for this test but I assume you'd like to be able to use all GPUs for most workloads. Am I misinterpreting this?

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented May 9, 2019

I tried the following spec, which allows only one visible device per process, and didn't see any improvement.

    spec = {
        i: {
            "cls": Nanny,
            "options": {
                "env": {
                    "CUDA_VISIBLE_DEVICES": str(i),  # <<<----- this is the major change
                    # 'UCX_NET_DEVICES': 'mlx5_%d:1' % (i // 2)
                },
                "interface": "ib%d" % (i // 2),
                "protocol": "ucx",
                "ncores": 1,
            },
        }
        for i in range(8)
    }

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented May 9, 2019

Right, so the challenge with the xp.cuda.runtime.setDevice approach is that we probably can't do that for all possible libraries that the user might want to use. Unfortunately there isn't any standard API for CUDA itself in Python.

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented May 9, 2019

That may work for this test but I assume you'd like to be able to use all GPUs for most workloads. Am I misinterpreting this?

Yes, but I'm comfortable with each dask worker having access to only one of them. I have one dask worker per GPU.

Regardless, it didn't seem to solve my immediate problem.

@Akshay-Venkatesh
Copy link
Copy Markdown

are you seeing these errors or something else?

cuda_ipc_md.c:62   UCX  ERROR cuCtxGetDevice(&cu_device) is failed. ret:invalid device context
[1557433464.508925] [dgx15:55612:0]       ucp_rkey.c:250  UCX  ERROR Failed to unpack remote key from remote md[5]: Input/output error
[dgx15:55612:0:55931] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x20)

If yes, do you mind trying UCX_TLS=rc,sm,tcp,cuda_copy ? Since the error is showing up in cuda_ipc transport I'm trying to avoid using that.

@rjzamora
Copy link
Copy Markdown
Member

rjzamora commented May 9, 2019

Setting UCX_TLS=rc,sm,tcp,cuda_copy helped for me. I also needed to set a non-default port for the scheduler on dgx15 (e.g. "port": 8788).

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented May 9, 2019

Right, so the challenge with the xp.cuda.runtime.setDevice approach is that we probably can't do that for all possible libraries that the user might want to use. Unfortunately there isn't any standard API for CUDA itself in Python.

@Akshay-Venkatesh would it make sense for UCX to respect the preference implied by CUDA_VISIBLE_DEVICES where the first value is used by default? This seems to be the strongest convention on this topic that I've seen so far (though I'm fairly new).

@Akshay-Venkatesh
Copy link
Copy Markdown

Akshay-Venkatesh commented May 9, 2019 via email

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented May 9, 2019

Thanks @Akshay-Venkatesh . It sounds like many of the issues are consistent with that problem. Hopefully that helps to resolve things. If there is a development branch that I can build from (that also includes your other changes) please let me know. I'd be keen to try it out.

@Akshay-Venkatesh
Copy link
Copy Markdown

I haven't completed the PR yet but I'll definitely post here when the PR is complete.

@mrocklin mrocklin force-pushed the dgx-1 branch 2 times, most recently from 1baac6e to a83d89c Compare May 28, 2019 21:32
@mrocklin
Copy link
Copy Markdown
Contributor Author

This repository will start depending on the dask/distributed master branch.

Is it ok to place this in the ci/cpu/build.sh and ci/gpu/build.sh files?

cc @rlratzel , who I think has been thinking about cross-project testing.

@mrocklin
Copy link
Copy Markdown
Contributor Author

This repository will start depending on the dask/distributed master branch.

Is it ok to place this in the ci/cpu/build.sh and ci/gpu/build.sh files?

cc @rlratzel , who I think has been thinking about cross-project testing.

Or maybe @raydouglass knows the answer to this?

@rlratzel
Copy link
Copy Markdown
Contributor

rlratzel commented May 30, 2019

This repository will start depending on the dask/distributed master branch.

Is it ok to place this in the ci/cpu/build.sh and ci/gpu/build.sh files?

Is the dependency needed just for running tests, or does the library itself require it?

If it's only needed for tests, then I believe it's a fairly common pattern to add a pip install line to the gpu/cpu scripts. As a nicety to future maintainers, I've found it helpful to add checks for special dependencies any tests need in the corresponding setup, with a clear error message if they're missing.

If the library itself needs it, then we'd obviously want to ensure it's installed with the library's conda package (meta.yml/requirements.txt). I'm not familiar with dask/distributed, but is that an option if the library depends on it?

@mrocklin
Copy link
Copy Markdown
Contributor Author

The library itself now depends on it. We'll have to release distributed before the next release of dask-cuda

@mrocklin
Copy link
Copy Markdown
Contributor Author

@rlratzel given the information above, what is the best approach here?

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented Jun 3, 2019

If the library itself needs it, then we'd obviously want to ensure it's installed with the library's conda package (meta.yml/requirements.txt). I'm not familiar with dask/distributed, but is that an option if the library depends on it?

So, yes, it will depend on the most recent version, which is not currently released. It seems odd to put a dev version in a requirements.txt/meta.yml file. I can though if you prefer.

If I don't get a follow up here soonish I'm going to just add a pip install line in the CI script.

@raydouglass
Copy link
Copy Markdown
Contributor

raydouglass commented Jun 3, 2019

So, yes, it will depend on the most recent version, which is not currently released. It seems odd to put a dev version in a requirements.txt/meta.yml file. I can though if you prefer.

I think this would be necessary though right? Otherwise the conda packages won't install the right dependencies.

If I don't get a follow up here soonish I'm going to just add a pip install line in the CI script.

When is distributed being released? Can this PR wait for that?

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented Jun 3, 2019

I think this would be necessary though right? Otherwise the conda packages won't install the right dependencies.

How do we handle other packages on which we depend on master, like cudf?

When is distributed being released? Can this PR wait for that?

A few weeks probably (there is a lot of churn in the upcoming release), but before RAPIDS 0.8.

Currently this repository won't work with master branch of distributed. This PR will fix those issues.

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented Jun 3, 2019

Any further suggestions @raydouglass ? I think it's ok to merge something so that master works with master, but I'm open to other suggestions.

@raydouglass
Copy link
Copy Markdown
Contributor

Any further suggestions @raydouglass ? I think it's ok to merge something so that master works with master, but I'm open to other suggestions.

The only suggestion I have is to wait which isn't ideal. So maybe make the pip install changes. Please open an issue to remove those changes after distributed is released. That way we'll have a record and won't forget to do it before v0.8 release.

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented Jun 5, 2019

rerun tests

@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented Jun 5, 2019

If it's only needed for tests, then I believe it's a fairly common pattern to add a pip install line to the gpu/cpu scripts.

This didn't suffice. The conda build tests fail. How would you all like this handled?

@raydouglass
Copy link
Copy Markdown
Contributor

The pip install has to be in conda/recipes/dask-cuda/build.sh script. During a conda build, a new environment is created, so environment changes in CI scripts aren't transferred over.

@mrocklin mrocklin merged commit 7606157 into rapidsai:branch-0.7 Jun 6, 2019
@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented Jun 6, 2019

Ah crap. It looks like this was targetting branch 0.7 when I merged. My apologies. What do I need to do to correct this on the ops end? I can easily move the commit over to 0.8, but I'm concerned that I might have messed with one of your internal systems.

mrocklin added a commit to mrocklin/dask-cuda that referenced this pull request Jun 6, 2019
@mrocklin
Copy link
Copy Markdown
Contributor Author

mrocklin commented Jun 6, 2019

If it was up to me I would probably just remove the commit and force-push, or add a revert commit.

mrocklin added a commit that referenced this pull request Jun 6, 2019
@kkraus14
Copy link
Copy Markdown
Contributor

@raydouglass @mike-wendt need you to jump in here to resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants