Skip to content

[ROCm] Fix ADDMM hipBLASLt regression#138267

Closed
naromero77amd wants to merge 8 commits intopytorch:mainfrom
ROCm:fix_addmm_hipblaslt_regression
Closed

[ROCm] Fix ADDMM hipBLASLt regression#138267
naromero77amd wants to merge 8 commits intopytorch:mainfrom
ROCm:fix_addmm_hipblaslt_regression

Conversation

@naromero77amd
Copy link
Collaborator

@naromero77amd naromero77amd commented Oct 17, 2024

Fixes #138067

A partial reversion of this PR: #137604

The breakage is on AMD GPUs that do not fully support hipBLASLt, e.g. gfx1100

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138267

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit aec924b with merge base dbd6ada (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@naromero77amd
Copy link
Collaborator Author

naromero77amd commented Oct 17, 2024

@pytorchbot label "topic: not user facing"

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 17, 2024

❌ 🤖 pytorchbot command failed:

Got EOF while in a quoted string```
Try `@pytorchbot --help` for more info.

@naromero77amd naromero77amd changed the title Fix ADDMM hipBLASLt regression [ROCm] Fix ADDMM hipBLASLt regression Oct 17, 2024
@pytorch-bot pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch labels Oct 17, 2024
@naromero77amd
Copy link
Collaborator Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Oct 17, 2024
@drisspg drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 18, 2024
malfet
malfet previously approved these changes Oct 20, 2024
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@malfet
Copy link
Contributor

malfet commented Oct 20, 2024

@pytorchbot merge -f "Lint + ROCM builds are fine"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jeffdaily
Copy link
Collaborator

@pytorchbot revert -c nosignal -m "this PR went to far when partially reverting #137604; the env var default should be the same on ROCm and CUDA"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Oct 21, 2024
This reverts commit 14a3e12.

Reverted #138267 on behalf of https://github.com/jeffdaily due to this PR went to far when partially reverting #137604; the env var default should be the same on ROCm and CUDA ([comment](#138267 (comment)))
@pytorchmergebot
Copy link
Collaborator

@naromero77amd your PR has been successfully reverted.

@pytorch-bot pytorch-bot bot dismissed malfet’s stale review October 21, 2024 19:33

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

@naromero77amd naromero77amd marked this pull request as draft October 21, 2024 19:43
@naromero77amd
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fix_addmm_hipblaslt_regression onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_addmm_hipblaslt_regression && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the fix_addmm_hipblaslt_regression branch from cb023ce to d2e7c17 Compare October 24, 2024 18:27
@eqy
Copy link
Collaborator

eqy commented Oct 25, 2024

Could we check in a repro from #138067 as a test?

Yes that is what I meant by my earlier comment that I tested it manually.

Sorry, I mean checking-in an actual test to safeguard the same failure from happening in the future

@naromero77amd
Copy link
Collaborator Author

Could we check in a repro from #138067 as a test?

Yes that is what I meant by my earlier comment that I tested it manually.

Sorry, I mean checking-in an actual test to safeguard the same failure from happening in the future

Thank you for bringing up testing. The right way to test this is with a gfx110x which we don't have in upstream CI yet but I have been told it is on the roadmap.

I discussed with @jeffdaily and we do need to get rid of one existing test case as it is no longer appropriate.

@jeffdaily
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Jan 27, 2025
Port the PR pytorch#138267 from upstream
main to fix the error
"RuntimeError: Attempting to use hipBLASLt on a unsupported
architecture!"

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
@naromero77amd naromero77amd deleted the fix_addmm_hipblaslt_regression branch October 29, 2025 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source Reverted topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Attempting to use hipBLASLt on a unsupported architecture!

7 participants