Skip to content

Add UCXX proxy backend#116

Merged
rapids-bot[bot] merged 20 commits intorapidsai:branch-25.08from
pentschev:ucxx-proxy
Jul 3, 2025
Merged

Add UCXX proxy backend#116
rapids-bot[bot] merged 20 commits intorapidsai:branch-25.08from
pentschev:ucxx-proxy

Conversation

@pentschev
Copy link
Copy Markdown
Member

@pentschev pentschev commented Jul 2, 2025

Add a proxy communication backend to enable UCXX when protocol="ucx", overriding UCX-Py. Additionally, provide warnings when UCXX is not installed letting the user know UCX-Py is EOL and will be removed, as well as a proper fallback protocol="ucx-old" to continue using UCX-Py while it's not completely archived.

Depends on rapidsai/dask-cuda#1516 and rapidsai/ucxx#453.

Comment on lines +649 to +650
backends["ucx"] = _rewrite_ucxx_backend()()
backends["ucx-old"] = UCXBackendOld()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question (for my education): Prior to the in-progress ucx -> ucxx migration, was it possible to use "ucxx" without dask-cuda?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct: https://github.com/dask/distributed/blob/main/distributed/comm/ucx.py

We'll need to deprecate / remove that once the dust settles here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's always been possible by specifying protocol="ucxx" if you have distributed-ucxx installed. However, I don't think I've ever heard of anybody ever doing that. Therefore, I decided to take a stab at just making this change for RAPIDS users, the non-RAPIDS community will still need to switch (if anybody is actually using UCX-Py) but they will not get a warning, at some point we'll just remove the UCX backend from Distributed and force users to install distributed-ucxx.

Copy link
Copy Markdown
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can / should we add distributed-ucxx as a dependency required dependency here? That removes the "I have ucx-py but not ucxx" scenario.

Comment on lines +649 to +650
backends["ucx"] = _rewrite_ucxx_backend()()
backends["ucx-old"] = UCXBackendOld()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct: https://github.com/dask/distributed/blob/main/distributed/comm/ucx.py

We'll need to deprecate / remove that once the dust settles here.

@pentschev
Copy link
Copy Markdown
Member Author

Can / should we add distributed-ucxx as a dependency required dependency here? That removes the "I have ucx-py but not ucxx" scenario.

I think not because UCX (UCX-Py inclusive) was/is not a required dependency for Dask(-CUDA) nor RAPIDS, so if we add distributed-ucxx as dependency here we're changing that. Maybe it would be ok but I think we should have a separate discussion then. Also if we do this, the warning will never be raised for existing users, so the switch will be immediately transparent for everyone and I'd like to give a heads-up for now.

rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this pull request Jul 3, 2025
…1516)

UCXX is becoming the default UCX communicator via a proxy backend with rapidsai/rapids-dask-dependency#116. This proxy backend now points `protocol="ucx"` to UCXX provided distributed-ucxx is installed, otherwise fallback to UCX-Py and emit a warning, and while UCX-Py is not archived provide a `protocol="ucx-old"` as fallback. This change ensures the proper (UCXX or UCX-Py) is initialized based on the specified protocol and whether distributed-ucxx is installed or not.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #1516
rapids-bot bot pushed a commit to rapidsai/ucxx that referenced this pull request Jul 3, 2025
Switch the default distributed-ucxx progress mode to `"blocking"`, which is equivalent to UCX-Py's default progress mode, allowing a more transparent move of Dask to UCXX. This change may later be reverted once we have more confidence in a successful transition to UCXX.

Additionally allow both `"ucx://"` and `"ucxx://"` address prefixes, since with the change from rapidsai/rapids-dask-dependency#116 both will be valid to enable UCXX when distributed-ucxx is installed.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Tom Augspurger (https://github.com/TomAugspurger)

URL: #453
@pentschev
Copy link
Copy Markdown
Member Author

I've added tests to this, and the tests depend on Dask-CUDA and NVML. Only then I noticed errors:

 │   pynvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found

This seems to be because the runners are CPU only. However, both conda and wheel build jobs are supposed to run on CUDA nodes, and also the containers pulled for both conda and wheel containers are CUDA as well. Any ideas why the runners being selected are CPU-only @jameslamb @vyasr ?

@pentschev pentschev marked this pull request as ready for review July 3, 2025 18:56
@pentschev pentschev requested review from a team as code owners July 3, 2025 18:56
@pentschev pentschev requested a review from gforsyth July 3, 2025 18:56
@vyasr
Copy link
Copy Markdown
Contributor

vyasr commented Jul 3, 2025

Builds do not run on GPU nodes, only tests do. The links you're pointing to are the versions of CUDA installed in the container because you need the CTK to build CUDA-enabled libraries, but we always build on nodes without CUDA installed on the host.

@pentschev
Copy link
Copy Markdown
Member Author

Builds do not run on GPU nodes, only tests do. The links you're pointing to are the versions of CUDA installed in the container because you need the CTK to build CUDA-enabled libraries, but we always build on nodes without CUDA installed on the host.

Argh. Is it possible to enable builds on CUDA nodes here? Otherwise I think it may be too much trouble to get tests working here, and I would probably think about moving the tests to a different repo (probably Dask-CUDA), but this PR would need to go in first anyway.

@vyasr
Copy link
Copy Markdown
Contributor

vyasr commented Jul 3, 2025

It is possible, but requires a bit more work that I don't know if you want to do. All our workflows in RAPIDS reuse the same shared workflows. You can see that in this repo's conda build job here. The types of nodes that each job runs on is therefore also governed by the shared workflow. The options that you have would be (in order of increasing complexity):

  1. Make the build workflow runners parametrizable so that you could request a GPU node in this case
  2. Switch the jobs here to use the test workflows, which do run on GPU nodes, but that would require you to manually recreate logic for uploading etc.
  3. Use a completely manually rewritten workflow.

Practically speaking my guess is that you'd only care enough to do the work for option 1. The main question is whether the build team will support having that option. The other options seem like a lot more work.

@pentschev
Copy link
Copy Markdown
Member Author

Thanks for the detailed response, Vyas! Indeed, I don't care enough to spend several more hours on this, and also potentially have to convince the build team to support that. In that case for me I'd prefer to be more agile here and merge this as is and add the test into Dask-CUDA after this is merged. I tested all cases multiple times, including in the build containers locally with the complete scripts I previously wrote for this PR and everything passed, this gives me enough confidence to think that solution should suffice. WDYT?

Copy link
Copy Markdown
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's not worth the effort. You've done enough testing and I'd rather we move forward. The changes look fine to me, so approving.

Copy link
Copy Markdown
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, but seems good to me. Thanks for moving forward with this Peter!


backends["ucx"] = UCXBackend()
class UCXConnectorOld(UCXConnector):
prefix = "ucx-old://"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using a different prefix for ucx-py. Is this for debugging purposes (so we can easily detect from the urls that ucx-py is being used)?

Copy link
Copy Markdown
Member Author

@pentschev pentschev Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original prefix ucx intentionally raises the warning if distributed-ucxx is not installed but allows continuing to use UCX-Py, the ucx-old prefix is only there to allow users silencing the warning, see https://github.com/rapidsai/rapids-dask-dependency/pull/116/files#diff-75fb79bf88eb7a0e127ec89ae51605efc72a6a829420b95c7463adaef1736878R574-R576.

@pentschev
Copy link
Copy Markdown
Member Author

Thanks all for the reviews! I'll go ahead and merge this now and watch the results from nightlies tomorrow morning.

@pentschev
Copy link
Copy Markdown
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 88d91b6 into rapidsai:branch-25.08 Jul 3, 2025
10 checks passed
rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this pull request Jul 4, 2025
With changes from rapidsai/rapids-dask-dependency#116, `protocol="ucx"` now uses UCXX when `distributed-ucxx` is installed as is the case in CI. Some tests broke because of the conditionals to validate UCX-Py vs UCXX, this change fixes those that are currently failing in CI, the remaining tests will be addressed in a follow-up PR.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #1518
copy-pr-bot bot pushed a commit to rapidsai/cuml that referenced this pull request Jul 15, 2025
rapidsai/rapids-dask-dependency#116 changes the
meaning of "ucx". To restore the old behavior, we use "ucx-old".

ucx-py, which the protocol `ucx-old` uses, is being deprecated in favor
of `ucxx` (rapidsai/ucx-py#1151) so we could
also consider removing this job / tests.
rapids-bot bot pushed a commit to rapidsai/raft that referenced this pull request Jul 23, 2025
rapidsai/rapids-dask-dependency#116 changes the meaning of `protocol="ucx"`. To restore the old behavior, we use `"ucx-old"`.

UCX-Py, which the protocol `"ucx-old"` uses, is being deprecated in favor of UCXX (rapidsai/ucx-py#1151) and is scheduled for removal in 25.10, allowing us to then remove UCX-Py tests entirely.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Bradley Dice (https://github.com/bdice)

URL: #2743
rapids-bot bot pushed a commit to rapidsai/cuml that referenced this pull request Jul 23, 2025
rapidsai/rapids-dask-dependency#116 changes the meaning of "ucx". To restore the old behavior, we use "ucx-old".

ucx-py, which the protocol `ucx-old` uses, is being deprecated in favor of `ucxx` (rapidsai/ucx-py#1151) so we could also consider removing this job / tests.

Authors:
  - Tom Augspurger (https://github.com/TomAugspurger)

Approvers:
  - Simon Adorf (https://github.com/csadorf)
  - Bradley Dice (https://github.com/bdice)

URL: #7008
@pentschev pentschev deleted the ucxx-proxy branch September 1, 2025 10:10
rapids-bot bot pushed a commit that referenced this pull request Sep 4, 2025
Rewrite the UCXX/UCX-Py proxy backends added in #116 to deprecate UCX-Py. We have already removed the UCX-Py backend from Distributed in dask/distributed#9105, but while there's no release we cannot rely on that, so the changes in the PR work in the same way we expect when the migration is completed and Distributed has a new release. If distributed-ucxx is not installed a warning and exception are raised telling the user to install it.

The state after this PR is as follows:

1. If distributed-ucxx is installed:
    1. `protocol="ucx"` / `protocol="ucxx"` should continue working as before, both were already using distributed-ucxx
    2. `protocol="ucx-old"` (only exists in rapids-dask-dependency added in #116) will raise the same `FutureWarning` as `protocol="ucx"` without RAPIDS
2. If distributed-ucxx is NOT installed:
    1. `protocol="ucx"` / `protocol="ucxx"` / `protocol="ucx-old"` all raise a `FutureWarning` and don’t let users create processes.

The final state is expected to be achieved when a new Distributed release is cut and the `__rdd_patch_ucx.py` file is removed from rapids-dask-dependency, which will be:

1. If distributed-ucxx is installed:
    1. `protocol="ucx"` (now transparently uses distributed-ucxx) / `protocol="ucxx"` (always used distributed-ucxx) should continue working as before
2. If distributed-ucxx is NOT installed:
    1. `protocol="ucx"` raises a `FutureWarning` and doesn’t let users create processes
    2. `protocol="ucxx"` never existed before distributed-ucxx so it will just fail with a wrong protocol name

Specifically I've tested the following to confirm what's behavior matches what is described above as the state of this PR (using all `ucx`/`ucxx`/`ucx-old` combinations):

1. `dask scheduler --protocol=ucx`
2. `dask worker ucx://127.0.0.1:8786`
3. `LocalCluster(protocol="ucx", silence_logs=False)`
4. `LocalCUDACluster(protocol="ucx", silence_logs=False)`

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Tom Augspurger (https://github.com/TomAugspurger)

URL: #120
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Introduces a breaking change feature request New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants