[core][rdt] Enable nixl for RDT Microbenchmarks#59291
[core][rdt] Enable nixl for RDT Microbenchmarks#59291stephanie-wang merged 11 commits intoray-project:masterfrom
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request enables nixl for the RDT microbenchmarks and fixes a boolean logic bug. The changes look good. I've added a couple of suggestions to improve code readability and maintainability in the Python script and the YAML configuration file.
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
|
|
||
| run: | ||
| timeout: 1800 | ||
| # NIXL currently only works on T4's for <100mb tensors so not enabling. |
There was a problem hiding this comment.
just for my understanding, why doesn't NIXL work with non - T4 and sub 100MB tensors? Is it slow or will throw an error?
There was a problem hiding this comment.
tbh i have no idea... it'll throw an error on memory registration, going to ask the nixl ppl
File "/opt/conda/envs/ray-dev/lib/python3.10/site-packages/nixl_cu12/_api.py", line 384, in register_memory
self.agent.registerMem(reg_descs, handle_list)
nixl_cu12._bindings.nixlBackendError: NIXL_ERR_BACKEND
(GPUActor pid=290793) E1212 05:43:54.376948 291056 nixl_agent.cpp:487] registerMem: registration failed for the specified or all potential backends
(GPUActor pid=290793) [1765518234.376927] [ip-172-31-19-238:290793:0] gdr_copy_md.c:151 UCX ERROR gdr_pin_buffer failed. length :104857600 ret:12
(GPUActor pid=290793) [1765518234.376942] [ip-172-31-19-238:290793:0] ucp_mm.c:81 UCX ERROR failed to register address 0x7f610c000000 (cuda) length 104857600 on md[6]=gdr_copy: Input/output error (md supports: cuda)
| "--enable_nixl", | ||
| action="store_true", | ||
| ) | ||
| parser.add_argument( |
There was a problem hiding this comment.
just for my understanding, the torch benchmark is the baseline right and we don't run it in CI at all? Do you think it's worth running to compare the timings against the other transports?
Also since the torch transport can only be used with the torch test funcs, feel like it might be cleaner to pull out the torch stuff from the Actor and make a second Actor as just a "baseline" or "benchmark" actor. But not part of your PR.
There was a problem hiding this comment.
Hmm ya i guess it doesn't hurt to run the torch bench on the weekly. I was thinking since it doesn't really change, but it actually is subject to Ray changes.
And ya i can move it do a diff actor too in the pr where i enable the torch weekly baseline too
| - name: rdt_single_node_B200_microbenchmark | ||
| python: "3.10" | ||
| - name: rdt_single_node_A100_microbenchmark | ||
| python: "3.11" |
There was a problem hiding this comment.
why are we bumping the python version?
There was a problem hiding this comment.
The LLM images for release tests are only 3.11 😢
Signed-off-by: dayshah <dhyey2019@gmail.com>
Enabling nixl for the new RDT microbenchmarks. To get the expected performance out of nixl we need to set ``` - UCX_TLS=all - UCX_NET_DEVICES=all ``` Also fixing a boolean bug I had in the microbenchmark script before for when you enable the torch tests. --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: kriyanshii <kriyanshishah06@gmail.com>
Enabling nixl for the new RDT microbenchmarks. To get the expected performance out of nixl we need to set ``` - UCX_TLS=all - UCX_NET_DEVICES=all ``` Also fixing a boolean bug I had in the microbenchmark script before for when you enable the torch tests. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
Enabling nixl for the new RDT microbenchmarks. To get the expected performance out of nixl we need to set ``` - UCX_TLS=all - UCX_NET_DEVICES=all ``` Also fixing a boolean bug I had in the microbenchmark script before for when you enable the torch tests. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
Enabling nixl for the new RDT microbenchmarks. To get the expected performance out of nixl we need to set ``` - UCX_TLS=all - UCX_NET_DEVICES=all ``` Also fixing a boolean bug I had in the microbenchmark script before for when you enable the torch tests. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
Enabling nixl for the new RDT microbenchmarks. To get the expected performance out of nixl we need to set ``` - UCX_TLS=all - UCX_NET_DEVICES=all ``` Also fixing a boolean bug I had in the microbenchmark script before for when you enable the torch tests. --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Enabling nixl for the new RDT microbenchmarks. To get the expected performance out of nixl we need to set
Also fixing a boolean bug I had in the microbenchmark script before for when you enable the torch tests.