Skip to content

Enable faster cuBLAS path for torch.linalg.lstsq for batch of small matrices#74434

Closed
IvanYashchuk wants to merge 2 commits intopytorch:masterfrom
IvanYashchuk:lstsq-cublas-path
Closed

Enable faster cuBLAS path for torch.linalg.lstsq for batch of small matrices#74434
IvanYashchuk wants to merge 2 commits intopytorch:masterfrom
IvanYashchuk:lstsq-cublas-path

Conversation

@IvanYashchuk
Copy link
Copy Markdown
Collaborator

This PR enables cuBLAS path for torch.linalg.lstsq. Before this PR only cuSOLVER path was used for regular PyTorch builds (when built with MAGMA).

Performance results (also previously reported at #54725 (comment)):

|                            | before current PR | current PR | speedup |
|----------------------------|-------------------|------------|---------|
| torch.Size([32, 32, 32])   | 870               | 440        | 2x      |
| torch.Size([64, 32, 32])   | 1340              | 450        | 3x      |
| torch.Size([32, 64, 64])   | 9040              | 1839       | 5x      |
| torch.Size([64, 64, 64])   | 17000             | 1830       | 9.2x    |
| torch.Size([32, 128, 128]) | 23210             | 8560       | 2.7x    |
| torch.Size([64, 128, 128]) | 40000             | 8662       | 4.6x    |
| torch.Size([32, 256, 256]) | 58160             | 46150      | 1.2x    |
| torch.Size([64, 256, 256]) | 73220             | 52080      | 1.4x    |
Times are in microseconds (us).

@IvanYashchuk IvanYashchuk added module: cuda Related to torch.cuda, and CUDA support in general module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul topic: performance topic category labels Mar 19, 2022
@IvanYashchuk IvanYashchuk requested a review from mruberry March 19, 2022 18:04
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 19, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/IvanYashchuk/pytorch/blob/42b868c2bda62af6a5ee71c8c24e9584f2e00565/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/cuda
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
deploy-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk ✅ triggered
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk ✅ triggered
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk ✅ triggered
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled ✅ triggered
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled ✅ triggered
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck ✅ triggered
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled ✅ triggered
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default 🚫 skipped
linux-binary-libtorch-cxx11-abi ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk 🚫 skipped
linux-binary-libtorch-pre-cxx11 ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk 🚫 skipped
linux-binary-manywheel ciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk 🚫 skipped
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk 🚫 skipped
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk 🚫 skipped
linux-bionic-rocm4.5-py3.7-distributed ciflow/all, ciflow/linux, ciflow/rocm, ciflow/trunk 🚫 skipped
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk 🚫 skipped
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk 🚫 skipped
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk 🚫 skipped
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk 🚫 skipped
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk 🚫 skipped
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk 🚫 skipped
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build ciflow/all, ciflow/cpu, ciflow/default, ciflow/libtorch, ciflow/linux, ciflow/mobile, ciflow/trunk 🚫 skipped
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk 🚫 skipped
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default 🚫 skipped
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default 🚫 skipped
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default 🚫 skipped
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default 🚫 skipped
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default 🚫 skipped
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla 🚫 skipped
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win 🚫 skipped
windows-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default 🚫 skipped
windows-binary-libtorch-debug ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk 🚫 skipped
windows-binary-libtorch-release ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk 🚫 skipped
windows-binary-wheel ciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk 🚫 skipped

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Mar 19, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 42b868c (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

#endif // AT_MAGMA_ENABLED()
// On ROCm platform we can only use MAGMA here
// If MAGMA is not available, an error will be thrown
gels_magma(a, b, infos);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the behavior on ROCm?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't change it. It makes it more explicit that on ROCm magma function is called. Previously the chain of calls was "this_function() -> gels_looped() -> gels_magma()", now it's "this_function() -> gels_magma()".

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the explanation

Copy link
Copy Markdown
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! -- One question for you, @IvanYashchuk

@mruberry
Copy link
Copy Markdown
Collaborator

@pytorchbot merge this please

@github-actions
Copy link
Copy Markdown
Contributor

Hey @IvanYashchuk.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@IvanYashchuk IvanYashchuk added the release notes: linalg_frontend release notes category label Mar 22, 2022
@IvanYashchuk IvanYashchuk deleted the lstsq-cublas-path branch March 22, 2022 08:47
facebook-github-bot pushed a commit that referenced this pull request Mar 23, 2022
…atrices (#74434)

Summary:
This PR enables cuBLAS path for `torch.linalg.lstsq`. Before this PR only cuSOLVER path was used for regular PyTorch builds (when built with MAGMA).

Performance results (also previously reported at #54725 (comment)):
```
|                            | before current PR | current PR | speedup |
|----------------------------|-------------------|------------|---------|
| torch.Size([32, 32, 32])   | 870               | 440        | 2x      |
| torch.Size([64, 32, 32])   | 1340              | 450        | 3x      |
| torch.Size([32, 64, 64])   | 9040              | 1839       | 5x      |
| torch.Size([64, 64, 64])   | 17000             | 1830       | 9.2x    |
| torch.Size([32, 128, 128]) | 23210             | 8560       | 2.7x    |
| torch.Size([64, 128, 128]) | 40000             | 8662       | 4.6x    |
| torch.Size([32, 256, 256]) | 58160             | 46150      | 1.2x    |
| torch.Size([64, 256, 256]) | 73220             | 52080      | 1.4x    |
Times are in microseconds (us).
```

Pull Request resolved: #74434
Approved by: https://github.com/mruberry

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/4de870f6040d2799acc11b9bdeb6508eb6fd33d9

Reviewed By: malfet

Differential Revision: D35047957

fbshipit-source-id: c21c3fdabcf2fc747089e5915fe66561760602c3
shahofblah pushed a commit that referenced this pull request Mar 25, 2022
…atrices

This PR enables cuBLAS path for `torch.linalg.lstsq`. Before this PR only cuSOLVER path was used for regular PyTorch builds (when built with MAGMA).

Performance results (also previously reported at #54725 (comment)):
```
|                            | before current PR | current PR | speedup |
|----------------------------|-------------------|------------|---------|
| torch.Size([32, 32, 32])   | 870               | 440        | 2x      |
| torch.Size([64, 32, 32])   | 1340              | 450        | 3x      |
| torch.Size([32, 64, 64])   | 9040              | 1839       | 5x      |
| torch.Size([64, 64, 64])   | 17000             | 1830       | 9.2x    |
| torch.Size([32, 128, 128]) | 23210             | 8560       | 2.7x    |
| torch.Size([64, 128, 128]) | 40000             | 8662       | 4.6x    |
| torch.Size([32, 256, 256]) | 58160             | 46150      | 1.2x    |
| torch.Size([64, 256, 256]) | 73220             | 52080      | 1.4x    |
Times are in microseconds (us).
```

Pull Request resolved: #74434
Approved by: https://github.com/mruberry
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
…atrices

This PR enables cuBLAS path for `torch.linalg.lstsq`. Before this PR only cuSOLVER path was used for regular PyTorch builds (when built with MAGMA).

Performance results (also previously reported at pytorch#54725 (comment)):
```
|                            | before current PR | current PR | speedup |
|----------------------------|-------------------|------------|---------|
| torch.Size([32, 32, 32])   | 870               | 440        | 2x      |
| torch.Size([64, 32, 32])   | 1340              | 450        | 3x      |
| torch.Size([32, 64, 64])   | 9040              | 1839       | 5x      |
| torch.Size([64, 64, 64])   | 17000             | 1830       | 9.2x    |
| torch.Size([32, 128, 128]) | 23210             | 8560       | 2.7x    |
| torch.Size([64, 128, 128]) | 40000             | 8662       | 4.6x    |
| torch.Size([32, 256, 256]) | 58160             | 46150      | 1.2x    |
| torch.Size([64, 256, 256]) | 73220             | 52080      | 1.4x    |
Times are in microseconds (us).
```

Pull Request resolved: pytorch#74434
Approved by: https://github.com/mruberry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed module: cuda Related to torch.cuda, and CUDA support in general module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul open source release notes: linalg_frontend release notes category topic: performance topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants