Fix mm accuracy in ROCm for some inputs by lcskrishna · Pull Request #116537 · pytorch/pytorch

lcskrishna · 2023-12-29T14:17:57Z

This PR fixes the accuracy issues for hipblasLT for mm case on ROCm.
This PR is a follow up to the integration PR #114329 and #114890

The accuracy issue arises for mm usecase for ROCm where hipblasLT is enabled, and a bias has been passed which is not required. This PR addresses that issue.
Added a unit-test case for this issue (bias=None) case.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

pytorch-bot · 2023-12-29T14:18:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116537

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (8 Unrelated Failures)

As of commit ba7fafd with merge base 57491d2 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

periodic / linux-focal-rocm5.7-py3.8 / test (distributed, 1, 2, linux.rocm.gpu) (gh)
distributed/test_c10d_functional_native.py::C10DFunctionalNativeTest::test_inductor_all_gather_into_tensor_coalesced
periodic / linux-focal-rocm5.7-py3.8 / test (distributed, 2, 2, linux.rocm.gpu) (gh)
distributed/test_functional_api.py::TestNCCLCollectivesWithWorldSize4::test_tracing_with_fakepg

This comment was automatically generated by Dr. CI and updates every 15 minutes.

lcskrishna · 2023-12-29T18:16:23Z

@pytorchbot label ciflow/periodic

lcskrishna · 2023-12-29T18:16:39Z

@pytorchbot label ciflow/trunk

aten/src/ATen/native/cuda/Blas.cpp

aten/src/ATen/cuda/CUDABlas.cpp

albanD

A more detailed PR description and a test case in OpInfo that triggers this failure would be great.

aten/src/ATen/cuda/CUDABlas.cpp

…-hipblaslt

malfet

LGTM, but see comments about some minor issues (but as tests are limited to ROCm platform I'm not requesting changes)

test/test_linalg.py

jeffdaily · 2024-01-09T18:17:55Z

@pytorchbot merge

jeffdaily · 2024-01-09T23:28:12Z

@pytorchbot merge

pytorchmergebot · 2024-01-09T23:30:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-01-09T23:30:40Z

Merge failed

Reason: 6 jobs have failed, first few of them are: rocm, linux-binary-libtorch-pre-cxx11, linux-binary-manywheel, trunk, linux-binary-libtorch-cxx11-abi

Details for Dev Infra team

Raised by workflow job

facebook-github-bot · 2024-01-10T00:00:53Z

@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

xw285cornell · 2024-01-10T00:27:02Z

Any chance we can change the hipblaslt behavior to avoid cuda/hip divergence?

jeffdaily · 2024-01-10T00:30:41Z

Any chance we can change the hipblaslt behavior to avoid cuda/hip divergence?

Which part of it? The bias issue or the double type not being supported or both?

xw285cornell · 2024-01-10T00:40:28Z

aten/src/ATen/native/cuda/Blas.cpp

+#if defined(USE_ROCM)
+              // This condition is needed for mm case on ROCm for hipblasLt path.
+              // Passing the bias ptr as null to avoid accuracy issues for mm case.
+              (&result != &self) ? self.const_data_ptr<scalar_t>() : nullptr,


@jeffdaily I meant here that uses bias==nullptr to avoid setting the attributes in computeDesc iiuc. I wonder why setting those epilog attr will end up with wrong results.

jeffdaily · 2024-01-10T17:11:15Z

@pytorchbot merge

pytorchmergebot · 2024-01-10T17:13:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-01-10T17:13:38Z

Merge failed

Reason: 2 jobs have failed, first few of them are: rocm, periodic

Details for Dev Infra team

Raised by workflow job

jeffdaily · 2024-01-10T17:24:25Z

pytorchmergebot got confused. I had to remove and re-add ciflow labels. Hopefully all missing ciflows are available to mergebot now.

jeffdaily · 2024-01-10T22:10:47Z

@pytorchbot merge -f "unrelated failures"

pytorchmergebot · 2024-01-10T22:12:36Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

fixes for bias case in hipblaslt for mm api on ROCm

e51604d

pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch labels Dec 29, 2023

pytorchbot added the open source label Dec 29, 2023

pytorch-bot bot added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Dec 29, 2023

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 29, 2023

cpuhrsch reviewed Jan 3, 2024

View reviewed changes

aten/src/ATen/native/cuda/Blas.cpp Show resolved Hide resolved

cpuhrsch reviewed Jan 3, 2024

View reviewed changes

aten/src/ATen/cuda/CUDABlas.cpp Outdated Show resolved Hide resolved

cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 3, 2024

albanD reviewed Jan 3, 2024

View reviewed changes

aten/src/ATen/cuda/CUDABlas.cpp Outdated Show resolved Hide resolved

jeffdaily self-requested a review January 3, 2024 18:30

lcskrishna added 3 commits January 8, 2024 04:33

Merge branch 'main' of https://github.com/pytorch/pytorch into cl/fix…

df847cd

…-hipblaslt

update code based on review comments

ec1054f

added unit test and made the code more cleaner

60f3d56

lcskrishna requested review from IvanYashchuk, lezcano and nikitaved as code owners January 9, 2024 05:08

pytorch-bot bot added the release notes: linalg_frontend release notes category label Jan 9, 2024

lezcano removed their request for review January 9, 2024 07:53

skip hipblaslt tests on cuda

6c046b5

jeffdaily approved these changes Jan 9, 2024

View reviewed changes

malfet approved these changes Jan 9, 2024

View reviewed changes

test/test_linalg.py Outdated Show resolved Hide resolved

test/test_linalg.py Outdated Show resolved Hide resolved

test/test_linalg.py Outdated Show resolved Hide resolved

test/test_linalg.py Outdated Show resolved Hide resolved

malfet reviewed Jan 9, 2024

View reviewed changes

test/test_linalg.py Outdated Show resolved Hide resolved

malfet and others added 2 commits January 9, 2024 09:04

Apply suggestions from code review

87458b2

address comments in unittest for hipblaslt

9933f8c

pytorchmergebot added the merging label Jan 9, 2024

pytorchmergebot removed the merging label Jan 9, 2024

jeffdaily added ciflow/trunk Trigger trunk jobs on your pull request and removed ciflow/trunk Trigger trunk jobs on your pull request labels Jan 9, 2024

xw285cornell reviewed Jan 10, 2024

View reviewed changes

pytorchmergebot added the merging label Jan 10, 2024

pytorchmergebot removed the merging label Jan 10, 2024

pytorchmergebot added the merging label Jan 10, 2024

pytorchmergebot closed this in b9293e7 Jan 10, 2024

pytorchmergebot added Merged and removed merging labels Jan 10, 2024

lezcano changed the title ~~[ROCm] Fixes for hipblasLt for mm use case.~~ Fix mm accuracy in ROCm for some inputs Mar 29, 2024

Conversation

lcskrishna commented Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116537

✅ You can merge normally! (8 Unrelated Failures)

Uh oh!

lcskrishna commented Dec 29, 2023

Uh oh!

lcskrishna commented Dec 29, 2023

Uh oh!

Uh oh!

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeffdaily commented Jan 9, 2024

Uh oh!

jeffdaily commented Jan 9, 2024

Uh oh!

pytorchmergebot commented Jan 9, 2024

Merge started

Uh oh!

pytorchmergebot commented Jan 9, 2024

Merge failed

Uh oh!

facebook-github-bot commented Jan 10, 2024

Uh oh!

xw285cornell commented Jan 10, 2024

Uh oh!

jeffdaily commented Jan 10, 2024

Uh oh!

xw285cornell Jan 10, 2024

Choose a reason for hiding this comment

Uh oh!

jeffdaily commented Jan 10, 2024

Uh oh!

pytorchmergebot commented Jan 10, 2024

Merge started

Uh oh!

pytorchmergebot commented Jan 10, 2024

Merge failed

Uh oh!

jeffdaily commented Jan 10, 2024

Uh oh!

jeffdaily commented Jan 10, 2024

Uh oh!

pytorchmergebot commented Jan 10, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

lcskrishna commented Dec 29, 2023 •

edited

Loading

pytorch-bot bot commented Dec 29, 2023 •

edited

Loading