Skip to content

Allow ORT actively fallback CUDAExecutionProvider to ROCMExecutionProvider#16895

Closed
cloudhan wants to merge 2 commits intomainfrom
guangyunhan/fallback-cuda-to-rocm
Closed

Allow ORT actively fallback CUDAExecutionProvider to ROCMExecutionProvider#16895
cloudhan wants to merge 2 commits intomainfrom
guangyunhan/fallback-cuda-to-rocm

Conversation

@cloudhan
Copy link
Contributor

In the wild, for example, pytorch and huggingface (pytorch pipelines) use cuda for amd gpu. Their user can basically painless switch from cuda devices to rocm devices. That is, in pytorch world they fallback cuda to rocm.

When switched to hf ort backend, it will populate a string CUDAExecutionProvider automatically to pin down the provider. Then ORT will not play well in this case because the framework will fallback cuda to cpu.

The disparity creates a lot of headache during benchmarking the ROCm EP when reusing scripts for cuda.

This PR address it by allow cuda to fallback to rocm.

@cloudhan cloudhan force-pushed the guangyunhan/fallback-cuda-to-rocm branch from 6e24e9c to a9cb7ce Compare July 28, 2023 05:59

def set_provider_options(name, options):
if (
os.environ.get("ORT_FALLBACK_CUDA_EP_TO_ROCM_EP", "0") == "1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've avoided env variables so far for configs. Let's continue this convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not config, this is a walkaround. This fallback logic happens right before the construction of Session, so entry ORT_FALLBACK_CUDA_EP_TO_ROCM_EP neither fits into session options nor provider options.

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants