Avoid poisoning process with CUDA calls as soon as importing by HollowMan6 · Pull Request #183 · ROCm/TransformerEngine

HollowMan6 · 2025-05-09T10:49:39Z

Description

Let's not call is_fp8_fnuz outside any function, as that will call torch.cuda.get_device_capability and lazily init the CUDA environment:
https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Current way will then make all the CUDA/HIP/ROCM_VISIBLE_DEVICES changes not effective after importing:
pytorch/pytorch#141678

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Package all is_fp8_fnuz related checks that are outside the functions to be inside the lambda functions.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Let's not call `is_fp8_fnuz` outside any function, as that will call `torch.cuda.get_device_capability` and lazily init the CUDA environment: https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392 Current way will then make all the `CUDA/HIP/ROCM_VISIBLE_DEVICES` changes not effective after importing: pytorch/pytorch#141678 Signed-off-by: Hollow Man <hollowman@opensuse.org>

…Fix AMD support) (#1465) ### Checklist Before Starting - [X] Search for similar PR(s). ### What does this PR do? Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD support ### High-Level Design Current approach for supporting AMD in verl is fundamentally not correct, and is just working out of the luck: Calls such as `torch.cuda.is_available()` or `torch.cuda.get_device_name()` will initialize the CUDA/ROCm environment: https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392 Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized will not take effect (Please check pytorch/pytorch#141678), which means that all current code that wrapped inside `[SUPPORT AMD: torch]` are mostly noops. CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of AMD migrated software call those `torch.cuda.*` during importing, e.g.: - ROCm/TransformerEngine#183 - vllm-project/vllm#15246 While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which cause those `torch.cuda.*` to poison the current process if the CUDA/ROCm environment is initialized before the manipulation happens. So, here, it would be a good solution to use only one environment variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA one. Note that we must pay attention if both HIP/CUDA and ROCR env vars are set as they have different meanings. Both env vars accept either a list of ints or a list of UUIDs. The ROCR env var is processed first which then reduces the number of GPUs that HIP can select from. (Refering to pytorch/pytorch#144026) To avoid the complexity of this, we simply gives out error if both are set (Also to keep consistency with ray's practice with 2.45.0). For the poisoning issue, before those 2 PRs are merged, we will need to ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or `RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer manipulates these variables, and make verl workable when there is no `*_VISIBLE_DEVICES`. Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`), we also need this patch: ray-project/ray#52794 ### Test Tested manually on both megatron and fsdp beckend with vllm. ### Additional Info. - **Issue Number**: none - **Training**: both FSDP and Megatron - **Inference**: both vLLM and SGLang ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] Add CI test(s) if neccessary. Signed-off-by: Hollow Man <hollowman@opensuse.org>

…Fix AMD support) (verl-project#1465) ### Checklist Before Starting - [X] Search for similar PR(s). ### What does this PR do? Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD support ### High-Level Design Current approach for supporting AMD in verl is fundamentally not correct, and is just working out of the luck: Calls such as `torch.cuda.is_available()` or `torch.cuda.get_device_name()` will initialize the CUDA/ROCm environment: https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392 Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized will not take effect (Please check pytorch/pytorch#141678), which means that all current code that wrapped inside `[SUPPORT AMD: torch]` are mostly noops. CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of AMD migrated software call those `torch.cuda.*` during importing, e.g.: - ROCm/TransformerEngine#183 - vllm-project/vllm#15246 While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which cause those `torch.cuda.*` to poison the current process if the CUDA/ROCm environment is initialized before the manipulation happens. So, here, it would be a good solution to use only one environment variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA one. Note that we must pay attention if both HIP/CUDA and ROCR env vars are set as they have different meanings. Both env vars accept either a list of ints or a list of UUIDs. The ROCR env var is processed first which then reduces the number of GPUs that HIP can select from. (Refering to pytorch/pytorch#144026) To avoid the complexity of this, we simply gives out error if both are set (Also to keep consistency with ray's practice with 2.45.0). For the poisoning issue, before those 2 PRs are merged, we will need to ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or `RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer manipulates these variables, and make verl workable when there is no `*_VISIBLE_DEVICES`. Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`), we also need this patch: ray-project/ray#52794 ### Test Tested manually on both megatron and fsdp beckend with vllm. ### Additional Info. - **Issue Number**: none - **Training**: both FSDP and Megatron - **Inference**: both vLLM and SGLang ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] Add CI test(s) if neccessary. Signed-off-by: Hollow Man <hollowman@opensuse.org>

…Fix AMD support) (#1465) ### Checklist Before Starting - [X] Search for similar PR(s). ### What does this PR do? Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD support ### High-Level Design Current approach for supporting AMD in verl is fundamentally not correct, and is just working out of the luck: Calls such as `torch.cuda.is_available()` or `torch.cuda.get_device_name()` will initialize the CUDA/ROCm environment: https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392 Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized will not take effect (Please check pytorch/pytorch#141678), which means that all current code that wrapped inside `[SUPPORT AMD: torch]` are mostly noops. CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of AMD migrated software call those `torch.cuda.*` during importing, e.g.: - ROCm/TransformerEngine#183 - vllm-project/vllm#15246 While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which cause those `torch.cuda.*` to poison the current process if the CUDA/ROCm environment is initialized before the manipulation happens. So, here, it would be a good solution to use only one environment variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA one. Note that we must pay attention if both HIP/CUDA and ROCR env vars are set as they have different meanings. Both env vars accept either a list of ints or a list of UUIDs. The ROCR env var is processed first which then reduces the number of GPUs that HIP can select from. (Refering to pytorch/pytorch#144026) To avoid the complexity of this, we simply gives out error if both are set (Also to keep consistency with ray's practice with 2.45.0). For the poisoning issue, before those 2 PRs are merged, we will need to ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or `RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer manipulates these variables, and make verl workable when there is no `*_VISIBLE_DEVICES`. Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`), we also need this patch: ray-project/ray#52794 ### Test Tested manually on both megatron and fsdp beckend with vllm. ### Additional Info. - **Issue Number**: none - **Training**: both FSDP and Megatron - **Inference**: both vLLM and SGLang ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] Add CI test(s) if neccessary. Signed-off-by: Hollow Man <hollowman@opensuse.org>

…Fix AMD support) (verl-project#1465) ### Checklist Before Starting - [X] Search for similar PR(s). ### What does this PR do? Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD support ### High-Level Design Current approach for supporting AMD in verl is fundamentally not correct, and is just working out of the luck: Calls such as `torch.cuda.is_available()` or `torch.cuda.get_device_name()` will initialize the CUDA/ROCm environment: https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392 Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized will not take effect (Please check pytorch/pytorch#141678), which means that all current code that wrapped inside `[SUPPORT AMD: torch]` are mostly noops. CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of AMD migrated software call those `torch.cuda.*` during importing, e.g.: - ROCm/TransformerEngine#183 - vllm-project/vllm#15246 While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which cause those `torch.cuda.*` to poison the current process if the CUDA/ROCm environment is initialized before the manipulation happens. So, here, it would be a good solution to use only one environment variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA one. Note that we must pay attention if both HIP/CUDA and ROCR env vars are set as they have different meanings. Both env vars accept either a list of ints or a list of UUIDs. The ROCR env var is processed first which then reduces the number of GPUs that HIP can select from. (Refering to pytorch/pytorch#144026) To avoid the complexity of this, we simply gives out error if both are set (Also to keep consistency with ray's practice with 2.45.0). For the poisoning issue, before those 2 PRs are merged, we will need to ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or `RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer manipulates these variables, and make verl workable when there is no `*_VISIBLE_DEVICES`. Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`), we also need this patch: ray-project/ray#52794 ### Test Tested manually on both megatron and fsdp beckend with vllm. ### Additional Info. - **Issue Number**: none - **Training**: both FSDP and Megatron - **Inference**: both vLLM and SGLang ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] Add CI test(s) if neccessary. Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 mentioned this pull request May 9, 2025

[Feat] Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES (Fix AMD support) verl-project/verl#1465

Merged

6 tasks

wenchenvincent requested a review from ipanfilo May 13, 2025 20:10

ipanfilo approved these changes May 14, 2025

View reviewed changes

ipanfilo merged commit 864405c into ROCm:dev May 14, 2025
5 checks passed

wangye805 mentioned this pull request May 16, 2025

Fix FP8 type selection in triton kernel #188

Merged

13 tasks

HollowMan6 mentioned this pull request Jun 28, 2025

Avoid early call of is_cdna4 #222

Merged

13 tasks

caaatch22 mentioned this pull request Sep 2, 2025

[Bug] what(): hipGetDeviceProperties failed with error: no ROCm-capable device is detected #293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid poisoning process with CUDA calls as soon as importing#183

Avoid poisoning process with CUDA calls as soon as importing#183
ipanfilo merged 1 commit intoROCm:devfrom
HollowMan6:poison

HollowMan6 commented May 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HollowMan6 commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HollowMan6 commented May 9, 2025 •

edited

Loading