Skip to content

[core][rdt] Enable nixl for RDT Microbenchmarks#59291

Merged
stephanie-wang merged 11 commits intoray-project:masterfrom
dayshah:rdt-nixl-microbench
Dec 16, 2025
Merged

[core][rdt] Enable nixl for RDT Microbenchmarks#59291
stephanie-wang merged 11 commits intoray-project:masterfrom
dayshah:rdt-nixl-microbench

Conversation

@dayshah
Copy link
Copy Markdown
Contributor

@dayshah dayshah commented Dec 9, 2025

Description

Enabling nixl for the new RDT microbenchmarks. To get the expected performance out of nixl we need to set

- UCX_TLS=all
- UCX_NET_DEVICES=all

Also fixing a boolean bug I had in the microbenchmark script before for when you enable the torch tests.

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Dec 9, 2025
@dayshah dayshah changed the title [core][rdt] [core][rdt] Enable nixl for RDT Microbenchmarks Dec 9, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables nixl for the RDT microbenchmarks and fixes a boolean logic bug. The changes look good. I've added a couple of suggestions to improve code readability and maintainability in the Python script and the YAML configuration file.

Comment thread release/release_tests.yaml Outdated
Signed-off-by: dayshah <dhyey2019@gmail.com>
@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label Dec 9, 2025
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Comment thread release/release_tests.yaml Outdated
Comment thread release/release_tests.yaml Outdated

run:
timeout: 1800
# NIXL currently only works on T4's for <100mb tensors so not enabling.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my understanding, why doesn't NIXL work with non - T4 and sub 100MB tensors? Is it slow or will throw an error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh i have no idea... it'll throw an error on memory registration, going to ask the nixl ppl

  File "/opt/conda/envs/ray-dev/lib/python3.10/site-packages/nixl_cu12/_api.py", line 384, in register_memory
    self.agent.registerMem(reg_descs, handle_list)
nixl_cu12._bindings.nixlBackendError: NIXL_ERR_BACKEND
(GPUActor pid=290793) E1212 05:43:54.376948  291056 nixl_agent.cpp:487] registerMem: registration failed for the specified or all potential backends
(GPUActor pid=290793) [1765518234.376927] [ip-172-31-19-238:290793:0]     gdr_copy_md.c:151  UCX  ERROR gdr_pin_buffer failed. length :104857600 ret:12
(GPUActor pid=290793) [1765518234.376942] [ip-172-31-19-238:290793:0]          ucp_mm.c:81   UCX  ERROR failed to register address 0x7f610c000000 (cuda) length 104857600 on md[6]=gdr_copy: Input/output error (md supports: cuda)

Comment thread release/ray_release/byod/byod_nixl_setup.sh Outdated
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah requested a review from Sparks0219 December 12, 2025 05:47
Signed-off-by: dayshah <dhyey2019@gmail.com>
"--enable_nixl",
action="store_true",
)
parser.add_argument(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my understanding, the torch benchmark is the baseline right and we don't run it in CI at all? Do you think it's worth running to compare the timings against the other transports?
Also since the torch transport can only be used with the torch test funcs, feel like it might be cleaner to pull out the torch stuff from the Actor and make a second Actor as just a "baseline" or "benchmark" actor. But not part of your PR.

Copy link
Copy Markdown
Contributor Author

@dayshah dayshah Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ya i guess it doesn't hurt to run the torch bench on the weekly. I was thinking since it doesn't really change, but it actually is subject to Ray changes.

And ya i can move it do a diff actor too in the pr where i enable the torch weekly baseline too

- name: rdt_single_node_B200_microbenchmark
python: "3.10"
- name: rdt_single_node_A100_microbenchmark
python: "3.11"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we bumping the python version?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LLM images for release tests are only 3.11 😢

Copy link
Copy Markdown
Contributor

@Sparks0219 Sparks0219 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚢

Signed-off-by: dayshah <dhyey2019@gmail.com>
@stephanie-wang stephanie-wang merged commit 0ddb7ee into ray-project:master Dec 16, 2025
6 checks passed
kriyanshii pushed a commit to kriyanshii/ray that referenced this pull request Dec 16, 2025
Enabling nixl for the new RDT microbenchmarks. To get the expected
performance out of nixl we need to set
```
- UCX_TLS=all
- UCX_NET_DEVICES=all
```

Also fixing a boolean bug I had in the microbenchmark script before for
when you enable the torch tests.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: kriyanshii <kriyanshishah06@gmail.com>
cszhu pushed a commit that referenced this pull request Dec 17, 2025
Enabling nixl for the new RDT microbenchmarks. To get the expected
performance out of nixl we need to set
```
- UCX_TLS=all
- UCX_NET_DEVICES=all
```

Also fixing a boolean bug I had in the microbenchmark script before for
when you enable the torch tests.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
zzchun pushed a commit to zzchun/ray that referenced this pull request Dec 18, 2025
Enabling nixl for the new RDT microbenchmarks. To get the expected
performance out of nixl we need to set
```
- UCX_TLS=all
- UCX_NET_DEVICES=all
```

Also fixing a boolean bug I had in the microbenchmark script before for
when you enable the torch tests.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request Dec 22, 2025
Enabling nixl for the new RDT microbenchmarks. To get the expected
performance out of nixl we need to set
```
- UCX_TLS=all
- UCX_NET_DEVICES=all
```

Also fixing a boolean bug I had in the microbenchmark script before for
when you enable the torch tests.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
Enabling nixl for the new RDT microbenchmarks. To get the expected
performance out of nixl we need to set
```
- UCX_TLS=all
- UCX_NET_DEVICES=all
```

Also fixing a boolean bug I had in the microbenchmark script before for
when you enable the torch tests.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants