Skip to content

Initialize UCXX when distributed-ucxx installed and protocol=ucx#1516

Merged
rapids-bot[bot] merged 4 commits intorapidsai:branch-25.08from
pentschev:ucxx-proxy
Jul 3, 2025
Merged

Initialize UCXX when distributed-ucxx installed and protocol=ucx#1516
rapids-bot[bot] merged 4 commits intorapidsai:branch-25.08from
pentschev:ucxx-proxy

Conversation

@pentschev
Copy link
Copy Markdown
Member

UCXX is becoming the default UCX communicator via a proxy backend with rapidsai/rapids-dask-dependency#116. This proxy backend now points protocol="ucx" to UCXX provided distributed-ucxx is installed, otherwise fallback to UCX-Py and emit a warning, and while UCX-Py is not archived provide a protocol="ucx-old" as fallback. This change ensures the proper (UCXX or UCX-Py) is initialized based on the specified protocol and whether distributed-ucxx is installed or not.

@pentschev pentschev requested a review from a team as a code owner July 2, 2025 20:59
@pentschev pentschev self-assigned this Jul 2, 2025
@github-actions github-actions bot added the python python code needed label Jul 2, 2025
@pentschev pentschev added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change and removed python python code needed labels Jul 2, 2025
Copy link
Copy Markdown
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic is not really changing, but I am nonetheless slightly confused

Comment on lines +49 to +55
and not distributed.comm.ucx.cuda_context_created.has_context
):
distributed.comm.ucx._warn_existing_cuda_context(ctx, os.getpid())

_create_cuda_context_handler()

distributed.comm.ucx.init_once()
elif protocol == "ucxx":
import distributed_ucxx.ucxx
if not distributed.comm.ucx.cuda_context_created.has_context:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we got ModuleNotFoundError on line 37, then we arrive here and distributed.comm.ucx does not exist. How does this work?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky case, admittedly I didn't remember myself, great catch! What happens is distributed.comm.ucx is always available so long distributed is installed (which it always is for Dask-CUDA to work). However, ucx-py may not be installed and thus raise ModuleNotFoundError when import ucp fails. With that the distributed.comm.ucx module can be populated with the context information, but it will indeed never come here because protocol="ucx" then becomes invalid. The decision to store context information in distributed.comm.ucx for all cases stems from the comment in lines 31-32, in this manner we prevent duplicate warnings and also allow other protocols such as protocol="tcp" to raise the warning as it used to be the case long ago, and this has since become a bug for protocol="tcp".

I have now fixed the bug for "other protocols" in 66228b4 and added a TODO in c0c44eb, but that will need to be reworked for non-UCX cases after distributed.comm.ucx is removed along with UCX-Py, and filed #1517 to update that later.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Comment on lines +84 to +88
if (
ctx.has_context
and not distributed_ucxx.ucxx.cuda_context_created.has_context
):
distributed_ucxx.ucxx._warn_existing_cuda_context(ctx, os.getpid())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, these variables are all unbound if ModuleNotFoundError was caught.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an oversight, we don't need the check for UCXX since distributed_ucxx is required to be installed to come here. Fixed in 82b97d4 .

Comment on lines +109 to +115
if protocol == "ucxx" or (has_ucxx and protocol == "ucx"):
# With https://github.com/rapidsai/rapids-dask-dependency/pull/116,
# `protocol="ucx"` now points to UCXX (if distributed-ucxx is installed),
# thus call the UCXX initializer.
_initialize_ucxx()
else:
_initialize_ucx()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused about where the warning is, but it's in rapids-dask-dependency, I think

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. I can make the comment more informative if you think it's required, but this should also just be here until 25.10.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is ok.

@github-actions github-actions bot added the python python code needed label Jul 3, 2025
@pentschev
Copy link
Copy Markdown
Member Author

Thanks for the nice catch here @wence- , everything should be addressed now if you could take another look.

@pentschev
Copy link
Copy Markdown
Member Author

Thanks again Lawrence!

@pentschev
Copy link
Copy Markdown
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 64030d0 into rapidsai:branch-25.08 Jul 3, 2025
28 checks passed
rapids-bot bot pushed a commit to rapidsai/rapids-dask-dependency that referenced this pull request Jul 3, 2025
Add a proxy communication backend to enable UCXX when `protocol="ucx"`, overriding UCX-Py. Additionally, provide warnings when UCXX is not installed letting the user know UCX-Py is EOL and will be removed, as well as a proper fallback `protocol="ucx-old"` to continue using UCX-Py while it's not completely archived.

Depends on rapidsai/dask-cuda#1516 and rapidsai/ucxx#453.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Tom Augspurger (https://github.com/TomAugspurger)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Richard (Rick) Zamora (https://github.com/rjzamora)

URL: #116
@pentschev pentschev deleted the ucxx-proxy branch July 4, 2025 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change python python code needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants