Add UCXX proxy backend#116
Conversation
rapids_dask_dependency/patches/distributed/comm/__rdd_patch_ucx.py
Outdated
Show resolved
Hide resolved
| backends["ucx"] = _rewrite_ucxx_backend()() | ||
| backends["ucx-old"] = UCXBackendOld() |
There was a problem hiding this comment.
Question (for my education): Prior to the in-progress ucx -> ucxx migration, was it possible to use "ucxx" without dask-cuda?
There was a problem hiding this comment.
Correct: https://github.com/dask/distributed/blob/main/distributed/comm/ucx.py
We'll need to deprecate / remove that once the dust settles here.
There was a problem hiding this comment.
Yes, it's always been possible by specifying protocol="ucxx" if you have distributed-ucxx installed. However, I don't think I've ever heard of anybody ever doing that. Therefore, I decided to take a stab at just making this change for RAPIDS users, the non-RAPIDS community will still need to switch (if anybody is actually using UCX-Py) but they will not get a warning, at some point we'll just remove the UCX backend from Distributed and force users to install distributed-ucxx.
TomAugspurger
left a comment
There was a problem hiding this comment.
Can / should we add distributed-ucxx as a dependency required dependency here? That removes the "I have ucx-py but not ucxx" scenario.
| backends["ucx"] = _rewrite_ucxx_backend()() | ||
| backends["ucx-old"] = UCXBackendOld() |
There was a problem hiding this comment.
Correct: https://github.com/dask/distributed/blob/main/distributed/comm/ucx.py
We'll need to deprecate / remove that once the dust settles here.
rapids_dask_dependency/patches/distributed/comm/__rdd_patch_ucx.py
Outdated
Show resolved
Hide resolved
I think not because UCX (UCX-Py inclusive) was/is not a required dependency for Dask(-CUDA) nor RAPIDS, so if we add |
…1516) UCXX is becoming the default UCX communicator via a proxy backend with rapidsai/rapids-dask-dependency#116. This proxy backend now points `protocol="ucx"` to UCXX provided distributed-ucxx is installed, otherwise fallback to UCX-Py and emit a warning, and while UCX-Py is not archived provide a `protocol="ucx-old"` as fallback. This change ensures the proper (UCXX or UCX-Py) is initialized based on the specified protocol and whether distributed-ucxx is installed or not. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #1516
Switch the default distributed-ucxx progress mode to `"blocking"`, which is equivalent to UCX-Py's default progress mode, allowing a more transparent move of Dask to UCXX. This change may later be reverted once we have more confidence in a successful transition to UCXX. Additionally allow both `"ucx://"` and `"ucxx://"` address prefixes, since with the change from rapidsai/rapids-dask-dependency#116 both will be valid to enable UCXX when distributed-ucxx is installed. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Tom Augspurger (https://github.com/TomAugspurger) URL: #453
|
I've added tests to this, and the tests depend on Dask-CUDA and NVML. Only then I noticed errors: This seems to be because the runners are CPU only. However, both conda and wheel build jobs are supposed to run on CUDA nodes, and also the containers pulled for both conda and wheel containers are CUDA as well. Any ideas why the runners being selected are CPU-only @jameslamb @vyasr ? |
|
Builds do not run on GPU nodes, only tests do. The links you're pointing to are the versions of CUDA installed in the container because you need the CTK to build CUDA-enabled libraries, but we always build on nodes without CUDA installed on the host. |
Argh. Is it possible to enable builds on CUDA nodes here? Otherwise I think it may be too much trouble to get tests working here, and I would probably think about moving the tests to a different repo (probably Dask-CUDA), but this PR would need to go in first anyway. |
|
It is possible, but requires a bit more work that I don't know if you want to do. All our workflows in RAPIDS reuse the same shared workflows. You can see that in this repo's conda build job here. The types of nodes that each job runs on is therefore also governed by the shared workflow. The options that you have would be (in order of increasing complexity):
Practically speaking my guess is that you'd only care enough to do the work for option 1. The main question is whether the build team will support having that option. The other options seem like a lot more work. |
|
Thanks for the detailed response, Vyas! Indeed, I don't care enough to spend several more hours on this, and also potentially have to convince the build team to support that. In that case for me I'd prefer to be more agile here and merge this as is and add the test into Dask-CUDA after this is merged. I tested all cases multiple times, including in the build containers locally with the complete scripts I previously wrote for this PR and everything passed, this gives me enough confidence to think that solution should suffice. WDYT? |
vyasr
left a comment
There was a problem hiding this comment.
I agree that it's not worth the effort. You've done enough testing and I'd rather we move forward. The changes look fine to me, so approving.
rjzamora
left a comment
There was a problem hiding this comment.
One question, but seems good to me. Thanks for moving forward with this Peter!
|
|
||
| backends["ucx"] = UCXBackend() | ||
| class UCXConnectorOld(UCXConnector): | ||
| prefix = "ucx-old://" |
There was a problem hiding this comment.
Why are we using a different prefix for ucx-py. Is this for debugging purposes (so we can easily detect from the urls that ucx-py is being used)?
There was a problem hiding this comment.
The original prefix ucx intentionally raises the warning if distributed-ucxx is not installed but allows continuing to use UCX-Py, the ucx-old prefix is only there to allow users silencing the warning, see https://github.com/rapidsai/rapids-dask-dependency/pull/116/files#diff-75fb79bf88eb7a0e127ec89ae51605efc72a6a829420b95c7463adaef1736878R574-R576.
|
Thanks all for the reviews! I'll go ahead and merge this now and watch the results from nightlies tomorrow morning. |
|
/merge |
With changes from rapidsai/rapids-dask-dependency#116, `protocol="ucx"` now uses UCXX when `distributed-ucxx` is installed as is the case in CI. Some tests broke because of the conditionals to validate UCX-Py vs UCXX, this change fixes those that are currently failing in CI, the remaining tests will be addressed in a follow-up PR. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #1518
rapidsai/rapids-dask-dependency#116 changes the meaning of "ucx". To restore the old behavior, we use "ucx-old". ucx-py, which the protocol `ucx-old` uses, is being deprecated in favor of `ucxx` (rapidsai/ucx-py#1151) so we could also consider removing this job / tests.
rapidsai/rapids-dask-dependency#116 changes the meaning of `protocol="ucx"`. To restore the old behavior, we use `"ucx-old"`. UCX-Py, which the protocol `"ucx-old"` uses, is being deprecated in favor of UCXX (rapidsai/ucx-py#1151) and is scheduled for removal in 25.10, allowing us to then remove UCX-Py tests entirely. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Bradley Dice (https://github.com/bdice) URL: #2743
rapidsai/rapids-dask-dependency#116 changes the meaning of "ucx". To restore the old behavior, we use "ucx-old". ucx-py, which the protocol `ucx-old` uses, is being deprecated in favor of `ucxx` (rapidsai/ucx-py#1151) so we could also consider removing this job / tests. Authors: - Tom Augspurger (https://github.com/TomAugspurger) Approvers: - Simon Adorf (https://github.com/csadorf) - Bradley Dice (https://github.com/bdice) URL: #7008
Rewrite the UCXX/UCX-Py proxy backends added in #116 to deprecate UCX-Py. We have already removed the UCX-Py backend from Distributed in dask/distributed#9105, but while there's no release we cannot rely on that, so the changes in the PR work in the same way we expect when the migration is completed and Distributed has a new release. If distributed-ucxx is not installed a warning and exception are raised telling the user to install it. The state after this PR is as follows: 1. If distributed-ucxx is installed: 1. `protocol="ucx"` / `protocol="ucxx"` should continue working as before, both were already using distributed-ucxx 2. `protocol="ucx-old"` (only exists in rapids-dask-dependency added in #116) will raise the same `FutureWarning` as `protocol="ucx"` without RAPIDS 2. If distributed-ucxx is NOT installed: 1. `protocol="ucx"` / `protocol="ucxx"` / `protocol="ucx-old"` all raise a `FutureWarning` and don’t let users create processes. The final state is expected to be achieved when a new Distributed release is cut and the `__rdd_patch_ucx.py` file is removed from rapids-dask-dependency, which will be: 1. If distributed-ucxx is installed: 1. `protocol="ucx"` (now transparently uses distributed-ucxx) / `protocol="ucxx"` (always used distributed-ucxx) should continue working as before 2. If distributed-ucxx is NOT installed: 1. `protocol="ucx"` raises a `FutureWarning` and doesn’t let users create processes 2. `protocol="ucxx"` never existed before distributed-ucxx so it will just fail with a wrong protocol name Specifically I've tested the following to confirm what's behavior matches what is described above as the state of this PR (using all `ucx`/`ucxx`/`ucx-old` combinations): 1. `dask scheduler --protocol=ucx` 2. `dask worker ucx://127.0.0.1:8786` 3. `LocalCluster(protocol="ucx", silence_logs=False)` 4. `LocalCUDACluster(protocol="ucx", silence_logs=False)` Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Tom Augspurger (https://github.com/TomAugspurger) URL: #120
Add a proxy communication backend to enable UCXX when
protocol="ucx", overriding UCX-Py. Additionally, provide warnings when UCXX is not installed letting the user know UCX-Py is EOL and will be removed, as well as a proper fallbackprotocol="ucx-old"to continue using UCX-Py while it's not completely archived.Depends on rapidsai/dask-cuda#1516 and rapidsai/ucxx#453.