[core][rdt] Enable nixl for RDT Microbenchmarks by dayshah · Pull Request #59291 · ray-project/ray

dayshah · 2025-12-09T06:39:32Z

Description

Enabling nixl for the new RDT microbenchmarks. To get the expected performance out of nixl we need to set

- UCX_TLS=all
- UCX_NET_DEVICES=all

Also fixing a boolean bug I had in the microbenchmark script before for when you enable the torch tests.

Signed-off-by: dayshah <dhyey2019@gmail.com>

gemini-code-assist

Code Review

This pull request enables nixl for the RDT microbenchmarks and fixes a boolean logic bug. The changes look good. I've added a couple of suggestions to improve code readability and maintainability in the Python script and the YAML configuration file.

Signed-off-by: dayshah <dhyey2019@gmail.com>

Sparks0219 · 2025-12-11T18:59:02Z


  run:
    timeout: 1800
+    # NIXL currently only works on T4's for <100mb tensors so not enabling.


just for my understanding, why doesn't NIXL work with non - T4 and sub 100MB tensors? Is it slow or will throw an error?

tbh i have no idea... it'll throw an error on memory registration, going to ask the nixl ppl

File "/opt/conda/envs/ray-dev/lib/python3.10/site-packages/nixl_cu12/_api.py", line 384, in register_memory self.agent.registerMem(reg_descs, handle_list) nixl_cu12._bindings.nixlBackendError: NIXL_ERR_BACKEND (GPUActor pid=290793) E1212 05:43:54.376948 291056 nixl_agent.cpp:487] registerMem: registration failed for the specified or all potential backends (GPUActor pid=290793) [1765518234.376927] [ip-172-31-19-238:290793:0] gdr_copy_md.c:151 UCX ERROR gdr_pin_buffer failed. length :104857600 ret:12 (GPUActor pid=290793) [1765518234.376942] [ip-172-31-19-238:290793:0] ucp_mm.c:81 UCX ERROR failed to register address 0x7f610c000000 (cuda) length 104857600 on md[6]=gdr_copy: Input/output error (md supports: cuda)

Signed-off-by: dayshah <dhyey2019@gmail.com>

Sparks0219 · 2025-12-12T18:47:32Z

+    "--enable_nixl",
+    action="store_true",
+)
 parser.add_argument(


just for my understanding, the torch benchmark is the baseline right and we don't run it in CI at all? Do you think it's worth running to compare the timings against the other transports?
Also since the torch transport can only be used with the torch test funcs, feel like it might be cleaner to pull out the torch stuff from the Actor and make a second Actor as just a "baseline" or "benchmark" actor. But not part of your PR.

Hmm ya i guess it doesn't hurt to run the torch bench on the weekly. I was thinking since it doesn't really change, but it actually is subject to Ray changes.

And ya i can move it do a diff actor too in the pr where i enable the torch weekly baseline too

Sparks0219 · 2025-12-12T18:48:02Z

- name: rdt_single_node_B200_microbenchmark
-  python: "3.10"
+- name: rdt_single_node_A100_microbenchmark
+  python: "3.11"


why are we bumping the python version?

The LLM images for release tests are only 3.11 😢

Sparks0219

LGTM 🚢

Signed-off-by: dayshah <dhyey2019@gmail.com>

Enabling nixl for the new RDT microbenchmarks. To get the expected performance out of nixl we need to set ``` - UCX_TLS=all - UCX_NET_DEVICES=all ``` Also fixing a boolean bug I had in the microbenchmark script before for when you enable the torch tests. --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: kriyanshii <kriyanshishah06@gmail.com>

Enabling nixl for the new RDT microbenchmarks. To get the expected performance out of nixl we need to set ``` - UCX_TLS=all - UCX_NET_DEVICES=all ``` Also fixing a boolean bug I had in the microbenchmark script before for when you enable the torch tests. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>

Enabling nixl for the new RDT microbenchmarks. To get the expected performance out of nixl we need to set ``` - UCX_TLS=all - UCX_NET_DEVICES=all ``` Also fixing a boolean bug I had in the microbenchmark script before for when you enable the torch tests. --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

dayshah added 2 commits December 9, 2025 01:40

fix torch run

d710dc7

Signed-off-by: dayshah <dhyey2019@gmail.com>

[core][rdt] Enable nixl for RDT single node microbenchmarks

7c50595

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah assigned stephanie-wang and Sparks0219 Dec 9, 2025

dayshah added the go add ONLY when ready to merge, run all tests label Dec 9, 2025

dayshah changed the title ~~[core][rdt]~~ [core][rdt] Enable nixl for RDT Microbenchmarks Dec 9, 2025

gemini-code-assist Bot reviewed Dec 9, 2025

View reviewed changes

Comment thread release/microbenchmark/experimental/rdt_single_node_microbenchmark.py

Comment thread release/release_tests.yaml Outdated

fix a100 name

a825d58

Signed-off-by: dayshah <dhyey2019@gmail.com>

ray-gardener Bot added the core Issues that should be addressed in Ray Core label Dec 9, 2025

dayshah added 5 commits December 9, 2025 09:32

Merge branch 'master' into rdt-nixl-microbench

ab31bd9

use llm image with nixl

1a67889

Signed-off-by: dayshah <dhyey2019@gmail.com>

try 3.11

23f35d2

Signed-off-by: dayshah <dhyey2019@gmail.com>

Merge branch 'master' into rdt-nixl-microbench

c87349d

enable nixl flag only on a100

98a6c99

Signed-off-by: dayshah <dhyey2019@gmail.com>

cursor Bot reviewed Dec 11, 2025

View reviewed changes

Comment thread release/release_tests.yaml Outdated

Sparks0219 reviewed Dec 11, 2025

View reviewed changes

fix

7cae19e

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah requested a review from Sparks0219 December 12, 2025 05:47

310

ddd0064

Signed-off-by: dayshah <dhyey2019@gmail.com>

Sparks0219 reviewed Dec 12, 2025

View reviewed changes

Sparks0219 approved these changes Dec 12, 2025

View reviewed changes

only enable nixl gpu for single node

0611132

Signed-off-by: dayshah <dhyey2019@gmail.com>

stephanie-wang approved these changes Dec 16, 2025

View reviewed changes

stephanie-wang merged commit 0ddb7ee into ray-project:master Dec 16, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][rdt] Enable nixl for RDT Microbenchmarks#59291

[core][rdt] Enable nixl for RDT Microbenchmarks#59291
stephanie-wang merged 11 commits intoray-project:masterfrom
dayshah:rdt-nixl-microbench

dayshah commented Dec 9, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sparks0219 Dec 11, 2025

Uh oh!

dayshah Dec 12, 2025

Uh oh!

Uh oh!

Sparks0219 Dec 12, 2025

Uh oh!

dayshah Dec 12, 2025 •

edited

Loading

Uh oh!

Sparks0219 Dec 12, 2025

Uh oh!

dayshah Dec 12, 2025

Uh oh!

Sparks0219 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dayshah commented Dec 9, 2025

Description

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sparks0219 Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sparks0219 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sparks0219 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Sparks0219 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dayshah Dec 12, 2025 •

edited

Loading