Initialize UCXX when distributed-ucxx installed and protocol=ucx#1516
Initialize UCXX when distributed-ucxx installed and protocol=ucx#1516rapids-bot[bot] merged 4 commits intorapidsai:branch-25.08from
protocol=ucx#1516Conversation
wence-
left a comment
There was a problem hiding this comment.
I think the logic is not really changing, but I am nonetheless slightly confused
| and not distributed.comm.ucx.cuda_context_created.has_context | ||
| ): | ||
| distributed.comm.ucx._warn_existing_cuda_context(ctx, os.getpid()) | ||
|
|
||
| _create_cuda_context_handler() | ||
|
|
||
| distributed.comm.ucx.init_once() | ||
| elif protocol == "ucxx": | ||
| import distributed_ucxx.ucxx | ||
| if not distributed.comm.ucx.cuda_context_created.has_context: |
There was a problem hiding this comment.
If we got ModuleNotFoundError on line 37, then we arrive here and distributed.comm.ucx does not exist. How does this work?
There was a problem hiding this comment.
This is a tricky case, admittedly I didn't remember myself, great catch! What happens is distributed.comm.ucx is always available so long distributed is installed (which it always is for Dask-CUDA to work). However, ucx-py may not be installed and thus raise ModuleNotFoundError when import ucp fails. With that the distributed.comm.ucx module can be populated with the context information, but it will indeed never come here because protocol="ucx" then becomes invalid. The decision to store context information in distributed.comm.ucx for all cases stems from the comment in lines 31-32, in this manner we prevent duplicate warnings and also allow other protocols such as protocol="tcp" to raise the warning as it used to be the case long ago, and this has since become a bug for protocol="tcp".
I have now fixed the bug for "other protocols" in 66228b4 and added a TODO in c0c44eb, but that will need to be reworked for non-UCX cases after distributed.comm.ucx is removed along with UCX-Py, and filed #1517 to update that later.
| if ( | ||
| ctx.has_context | ||
| and not distributed_ucxx.ucxx.cuda_context_created.has_context | ||
| ): | ||
| distributed_ucxx.ucxx._warn_existing_cuda_context(ctx, os.getpid()) |
There was a problem hiding this comment.
Same question here, these variables are all unbound if ModuleNotFoundError was caught.
There was a problem hiding this comment.
This is an oversight, we don't need the check for UCXX since distributed_ucxx is required to be installed to come here. Fixed in 82b97d4 .
dask_cuda/initialize.py
Outdated
| if protocol == "ucxx" or (has_ucxx and protocol == "ucx"): | ||
| # With https://github.com/rapidsai/rapids-dask-dependency/pull/116, | ||
| # `protocol="ucx"` now points to UCXX (if distributed-ucxx is installed), | ||
| # thus call the UCXX initializer. | ||
| _initialize_ucxx() | ||
| else: | ||
| _initialize_ucx() |
There was a problem hiding this comment.
I was confused about where the warning is, but it's in rapids-dask-dependency, I think
There was a problem hiding this comment.
Yes, that's right. I can make the comment more informative if you think it's required, but this should also just be here until 25.10.
|
Thanks for the nice catch here @wence- , everything should be addressed now if you could take another look. |
|
Thanks again Lawrence! |
|
/merge |
Add a proxy communication backend to enable UCXX when `protocol="ucx"`, overriding UCX-Py. Additionally, provide warnings when UCXX is not installed letting the user know UCX-Py is EOL and will be removed, as well as a proper fallback `protocol="ucx-old"` to continue using UCX-Py while it's not completely archived. Depends on rapidsai/dask-cuda#1516 and rapidsai/ucxx#453. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Tom Augspurger (https://github.com/TomAugspurger) - Vyas Ramasubramani (https://github.com/vyasr) - Richard (Rick) Zamora (https://github.com/rjzamora) URL: #116
UCXX is becoming the default UCX communicator via a proxy backend with rapidsai/rapids-dask-dependency#116. This proxy backend now points
protocol="ucx"to UCXX provided distributed-ucxx is installed, otherwise fallback to UCX-Py and emit a warning, and while UCX-Py is not archived provide aprotocol="ucx-old"as fallback. This change ensures the proper (UCXX or UCX-Py) is initialized based on the specified protocol and whether distributed-ucxx is installed or not.