Enable faster cuBLAS path for torch.linalg.lstsq for batch of small matrices by IvanYashchuk · Pull Request #74434 · pytorch/pytorch

IvanYashchuk · 2022-03-19T18:04:59Z

This PR enables cuBLAS path for torch.linalg.lstsq. Before this PR only cuSOLVER path was used for regular PyTorch builds (when built with MAGMA).

Performance results (also previously reported at #54725 (comment)):

|                            | before current PR | current PR | speedup |
|----------------------------|-------------------|------------|---------|
| torch.Size([32, 32, 32])   | 870               | 440        | 2x      |
| torch.Size([64, 32, 32])   | 1340              | 450        | 3x      |
| torch.Size([32, 64, 64])   | 9040              | 1839       | 5x      |
| torch.Size([64, 64, 64])   | 17000             | 1830       | 9.2x    |
| torch.Size([32, 128, 128]) | 23210             | 8560       | 2.7x    |
| torch.Size([64, 128, 128]) | 40000             | 8662       | 4.6x    |
| torch.Size([32, 256, 256]) | 58160             | 46150      | 1.2x    |
| torch.Size([64, 256, 256]) | 73220             | 52080      | 1.4x    |
Times are in microseconds (us).

pytorch-bot · 2022-03-19T18:05:03Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/IvanYashchuk/pytorch/blob/42b868c2bda62af6a5ee71c8c24e9584f2e00565/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/cuda
Add ciflow labels to this PR to trigger more builds:

Workflows	Labels (bold enabled)	Status
Triggered Workflows
deploy-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	✅ triggered
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	✅ triggered
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	✅ triggered
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	✅ triggered
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	🚫 skipped
linux-binary-libtorch-cxx11-abi	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	🚫 skipped
linux-binary-libtorch-pre-cxx11	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	🚫 skipped
linux-binary-manywheel	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`, `ciflow/trunk`	🚫 skipped
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	🚫 skipped
linux-bionic-rocm4.5-py3.7	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/rocm`, `ciflow/trunk`	🚫 skipped
linux-bionic-rocm4.5-py3.7-distributed	`ciflow/all`, `ciflow/linux`, `ciflow/rocm`, `ciflow/trunk`	🚫 skipped
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	🚫 skipped
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	🚫 skipped
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	🚫 skipped
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	🚫 skipped
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	🚫 skipped
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-arm64-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	🚫 skipped
macos-arm64-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	🚫 skipped
macos-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	🚫 skipped
macos-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	🚫 skipped
macos-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	🚫 skipped
macos-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`, `ciflow/xla`	🚫 skipped
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	🚫 skipped
windows-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	🚫 skipped
windows-binary-libtorch-debug	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	🚫 skipped
windows-binary-libtorch-release	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	🚫 skipped
windows-binary-wheel	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`, `ciflow/trunk`	🚫 skipped

facebook-github-bot · 2022-03-19T18:05:05Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/74434
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 42b868c (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

mruberry · 2022-03-22T08:24:57Z

-#endif // AT_MAGMA_ENABLED()
+    // On ROCm platform we can only use MAGMA here
+    // If MAGMA is not available, an error will be thrown
+    gels_magma(a, b, infos);


Why change the behavior on ROCm?

It doesn't change it. It makes it more explicit that on ROCm magma function is called. Previously the chain of calls was "this_function() -> gels_looped() -> gels_magma()", now it's "this_function() -> gels_magma()".

Cool, thanks for the explanation

mruberry

Cool! -- One question for you, @IvanYashchuk

mruberry · 2022-03-22T08:26:14Z

@pytorchbot merge this please

github-actions · 2022-03-22T08:28:23Z

Hey @IvanYashchuk.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

…atrices (#74434) Summary: This PR enables cuBLAS path for `torch.linalg.lstsq`. Before this PR only cuSOLVER path was used for regular PyTorch builds (when built with MAGMA). Performance results (also previously reported at #54725 (comment)): ``` | | before current PR | current PR | speedup | |----------------------------|-------------------|------------|---------| | torch.Size([32, 32, 32]) | 870 | 440 | 2x | | torch.Size([64, 32, 32]) | 1340 | 450 | 3x | | torch.Size([32, 64, 64]) | 9040 | 1839 | 5x | | torch.Size([64, 64, 64]) | 17000 | 1830 | 9.2x | | torch.Size([32, 128, 128]) | 23210 | 8560 | 2.7x | | torch.Size([64, 128, 128]) | 40000 | 8662 | 4.6x | | torch.Size([32, 256, 256]) | 58160 | 46150 | 1.2x | | torch.Size([64, 256, 256]) | 73220 | 52080 | 1.4x | Times are in microseconds (us). ``` Pull Request resolved: #74434 Approved by: https://github.com/mruberry Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/4de870f6040d2799acc11b9bdeb6508eb6fd33d9 Reviewed By: malfet Differential Revision: D35047957 fbshipit-source-id: c21c3fdabcf2fc747089e5915fe66561760602c3

…atrices This PR enables cuBLAS path for `torch.linalg.lstsq`. Before this PR only cuSOLVER path was used for regular PyTorch builds (when built with MAGMA). Performance results (also previously reported at #54725 (comment)): ``` | | before current PR | current PR | speedup | |----------------------------|-------------------|------------|---------| | torch.Size([32, 32, 32]) | 870 | 440 | 2x | | torch.Size([64, 32, 32]) | 1340 | 450 | 3x | | torch.Size([32, 64, 64]) | 9040 | 1839 | 5x | | torch.Size([64, 64, 64]) | 17000 | 1830 | 9.2x | | torch.Size([32, 128, 128]) | 23210 | 8560 | 2.7x | | torch.Size([64, 128, 128]) | 40000 | 8662 | 4.6x | | torch.Size([32, 256, 256]) | 58160 | 46150 | 1.2x | | torch.Size([64, 256, 256]) | 73220 | 52080 | 1.4x | Times are in microseconds (us). ``` Pull Request resolved: #74434 Approved by: https://github.com/mruberry

…atrices This PR enables cuBLAS path for `torch.linalg.lstsq`. Before this PR only cuSOLVER path was used for regular PyTorch builds (when built with MAGMA). Performance results (also previously reported at pytorch#54725 (comment)): ``` | | before current PR | current PR | speedup | |----------------------------|-------------------|------------|---------| | torch.Size([32, 32, 32]) | 870 | 440 | 2x | | torch.Size([64, 32, 32]) | 1340 | 450 | 3x | | torch.Size([32, 64, 64]) | 9040 | 1839 | 5x | | torch.Size([64, 64, 64]) | 17000 | 1830 | 9.2x | | torch.Size([32, 128, 128]) | 23210 | 8560 | 2.7x | | torch.Size([64, 128, 128]) | 40000 | 8662 | 4.6x | | torch.Size([32, 256, 256]) | 58160 | 46150 | 1.2x | | torch.Size([64, 256, 256]) | 73220 | 52080 | 1.4x | Times are in microseconds (us). ``` Pull Request resolved: pytorch#74434 Approved by: https://github.com/mruberry

IvanYashchuk added 2 commits March 19, 2022 12:46

Enable faster cuBLAS path for linalg.lstsq for batched small sizes

76c3c4d

Use AT_ROCM_ENABLED() instead of USE_ROCM

42b868c

IvanYashchuk added module: cuda Related to torch.cuda, and CUDA support in general module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul topic: performance topic category labels Mar 19, 2022

IvanYashchuk requested a review from mruberry March 19, 2022 18:04

IvanYashchuk requested review from lezcano and nikitaved as code owners March 19, 2022 18:04

pytorch-bot Bot added the ciflow/cuda label Mar 19, 2022

facebook-github-bot added the cla signed label Mar 19, 2022

pytorchbot added the open source label Mar 19, 2022

mruberry reviewed Mar 22, 2022

View reviewed changes

mruberry approved these changes Mar 22, 2022

View reviewed changes

pytorchmergebot closed this in 4de870f Mar 22, 2022

IvanYashchuk added the release notes: linalg_frontend release notes category label Mar 22, 2022

IvanYashchuk deleted the lstsq-cublas-path branch March 22, 2022 08:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable faster cuBLAS path for torch.linalg.lstsq for batch of small matrices#74434

Enable faster cuBLAS path for torch.linalg.lstsq for batch of small matrices#74434
IvanYashchuk wants to merge 2 commits intopytorch:masterfrom
IvanYashchuk:lstsq-cublas-path

IvanYashchuk commented Mar 19, 2022

Uh oh!

pytorch-bot Bot commented Mar 19, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Mar 19, 2022 •

edited

Loading

Uh oh!

mruberry Mar 22, 2022

Uh oh!

IvanYashchuk Mar 22, 2022

Uh oh!

mruberry Mar 22, 2022

Uh oh!

mruberry left a comment

Uh oh!

mruberry commented Mar 22, 2022

Uh oh!

github-actions Bot commented Mar 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

IvanYashchuk commented Mar 19, 2022

Uh oh!

pytorch-bot Bot commented Mar 19, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Mar 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

mruberry Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

IvanYashchuk Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

mruberry Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

mruberry left a comment

Choose a reason for hiding this comment

Uh oh!

mruberry commented Mar 22, 2022

Uh oh!

github-actions Bot commented Mar 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

facebook-github-bot commented Mar 19, 2022 •

edited

Loading