Skip to content

[ROCm] enable scaled_gemm#117822

Closed
jeffdaily wants to merge 5 commits intopytorch:mainfrom
ROCm:hipblaslt_scaled_gemm
Closed

[ROCm] enable scaled_gemm#117822
jeffdaily wants to merge 5 commits intopytorch:mainfrom
ROCm:hipblaslt_scaled_gemm

Conversation

@jeffdaily
Copy link
Collaborator

@jeffdaily jeffdaily commented Jan 19, 2024

scaled_gemm for ROCm using hipblaslt. As of ROCm 6.0, HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER is not supported. A work-around is provided, performing the absmax operation on the output buffer, but this results in some loss of accuracy for the absmax result. For this reason the feature should be considered beta/preview.

cc @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

@pytorch-bot pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch labels Jan 19, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117822

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 9c449fb with merge base fdae936 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

2 similar comments
@facebook-github-bot
Copy link
Contributor

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jianyuh
Copy link
Member

jianyuh commented Jan 19, 2024

Rebase to the trunk?

@jeffdaily
Copy link
Collaborator Author

Rebase to the trunk?

This PR depends on #115214 but it got reverted due to breaking the internal build. It's in the process of relanding.

@facebook-github-bot
Copy link
Contributor

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jeffdaily
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/117822/head returned non-zero exit code 1

Rebasing (1/1)
Auto-merging aten/src/ATen/cuda/CUDADataType.h
CONFLICT (content): Merge conflict in aten/src/ATen/cuda/CUDADataType.h
error: could not apply ddc20913ac7... [ROCm] enable scaled_gemm
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply ddc20913ac7... [ROCm] enable scaled_gemm

Raised by https://github.com/pytorch/pytorch/actions/runs/7616139952

@jeffdaily jeffdaily force-pushed the hipblaslt_scaled_gemm branch from ddc2091 to 15662f1 Compare January 22, 2024 19:41
@jeffdaily
Copy link
Collaborator Author

Manual rebase after #115214 relanded.

@jeffdaily jeffdaily force-pushed the hipblaslt_scaled_gemm branch from 15662f1 to 4da61cb Compare January 22, 2024 23:05
@facebook-github-bot
Copy link
Contributor

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jianyuh
Copy link
Member

jianyuh commented Jan 30, 2024

Hi @jeffdaily , could you rebase to the trunk? There is a conflict with master so we couldn't import it internally and test the PR. Thanks!

@jeffdaily
Copy link
Collaborator Author

Hi @jeffdaily , could you rebase to the trunk? There is a conflict with master so we couldn't import it internally and test the PR. Thanks!

Done.

@facebook-github-bot
Copy link
Contributor

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Member

@jianyuh jianyuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR ready for public / review ?

@jeffdaily jeffdaily marked this pull request as ready for review February 9, 2024 23:53
@jeffdaily jeffdaily requested a review from jianyuh February 9, 2024 23:53
@facebook-github-bot
Copy link
Contributor

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jianyuh
Copy link
Member

jianyuh commented Feb 26, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@xw285cornell
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@xw285cornell
Copy link
Contributor

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Feb 29, 2024
@xw285cornell
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 4 jobs have failed, first few of them are: trunk, linux-binary-manywheel, linux-binary-libtorch-cxx11-abi, linux-binary-libtorch-pre-cxx11

Details for Dev Infra team Raised by workflow job

@xw285cornell
Copy link
Contributor

@jeffdaily maybe fix the unit test given the merge has failed :)

@jeffdaily
Copy link
Collaborator Author

jeffdaily commented Feb 29, 2024

@jeffdaily maybe fix the unit test given the merge has failed :)

@xw285cornell Pushed commit to resolve the unit tests. You'll need to redo your import to resolve the Meta Internal-Only Changes Check that is now failing due to my new commit push.

@facebook-github-bot
Copy link
Contributor

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jianyuh
Copy link
Member

jianyuh commented Feb 29, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral)

Details for Dev Infra team Raised by workflow job

@jianyuh
Copy link
Member

jianyuh commented Feb 29, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@zou3519
Copy link
Contributor

zou3519 commented Aug 6, 2025

@jeffdaily is there a recommended way to detect that the rocm gpu is "ROCm MI300+"?

@jeffdaily
Copy link
Collaborator Author

@jeffdaily is there a recommended way to detect that the rocm gpu is "ROCm MI300+"?

In torch C++, we have if (at::detail::getCUDAHooks().isGPUArch({"gfx942"})).

In torch python, we have if "gfx94" in torch.cuda.get_device_properties(0).gcnArchName.split(":")[0].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants