cuSOLVER path for LU factorization in CUDA.#56887
Conversation
💊 CI failures summary and remediationsAs of commit 661be93 (more details on the Dr. CI page and at hud.pytorch.org/pr/56887):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
| Job | Step | Action |
|---|---|---|
| Report results | 🔁 rerun |
1 job timed out:
pytorch_linux_xenial_cuda11_3_cudnn8_py3_gcc7_test
🚧 1 fixed upstream failure:
These were probably caused by upstream breakages that were already fixed.
Please rebase on the viable/strict branch (expand for instructions)
If your commit is older than viable/strict, run these commands:
git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD
- pytorch_xla_linux_bionic_py3_6_clang9_test from Jun 23 until Jun 24 (80f40b1 - b7298f4)
Preview docs built from this PR
This comment was automatically generated by Dr. CI (expand for details).
Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group.
Codecov Report
@@ Coverage Diff @@
## master #56887 +/- ##
===========================================
+ Coverage 50.49% 77.68% +27.19%
===========================================
Files 596 1954 +1358
Lines 76023 194549 +118526
===========================================
+ Hits 38387 151134 +112747
- Misses 37636 43415 +5779 |
|
@IvanYashchuk I'm leaving out the 64 bit cusolver API in this since it malfunctions for singular matrices (different results than MAGMA/scipy) and pivot arrays used by other methods such as lu_solve need to 32 bit for now. Will try again after adding 64 bit functionality for lu_solve. I'll use ghstack. |
|
Sure, 64-bit functions is not critical to use now. We can add them later after 1.9 release. From my experience there is no performance difference in 32 vs 64 bit functions for other functions, and 32-bit functions are not deprecated even yet. So no need to worry about it for a while, could you please add this to the issue where we track cusolver & cublas bugs #53879 |
38b2d32 to
7eebf16
Compare
start updating cusolver LU solve functions update cuSOLVER routine convert nan to 0 updating kernel to work with complex dtypes add 64bit cusolver API fix minor problems in CUDA solver 64 bit API remove 64 bit usage of cusolver add 64 bit CUSOLVER interface Add cuSOLVER 64 bit API for single batched LU factorization. Add cuSOLVER 64 bit API for single batched LU factorization. remove 64 bit API due to wrong pivot arrays being returned for singular matrix and breaking other routines Add cuSOLVER path to single batch LU factorization. update cuSOLVER routine resolve cherry pick resolve cherry pick conflict update files for cherry pick update commits remove 64 bit usage of cusolver add 64 bit CUSOLVER interface remove 64 bit API due to wrong pivot arrays being returned for singular matrix and breaking other routines
7eebf16 to
3f9acc9
Compare
|
@v0dro can you please add benchmarks? |
|
The results are in pdf linked in this comment #53879 (comment). |
|
Thanks Ivan. That's 8 figures, no code to produce them. It would be really useful to put the figures plus the code to produce them in the PR description here. That's not only helpful for review, but also for future follow-up - no one is going to be able to find back a pdf in a comment on an issue not linked to this PR if there's a regression detected post-merge. |
IvanYashchuk
left a comment
There was a problem hiding this comment.
@v0dro, thank you for this pull request and for working on comparing the MAGMA/cuBLAS/cuSOLVER implementations.
I left a few suggestions inline.
As Ralf mentioned could you please add a comment with the code to produce the performance results, and link it from the main description of the PR.
Could you please also edit the description of the PR with the chosen conditions when cuSOLVER is used and when batched MAGMA is used.
| } | ||
|
|
||
| void apply_lu_cusolver_looped(Tensor& self, Tensor& pivots, Tensor& infos, bool get_pivots) { | ||
| auto infos_data = infos.data_ptr<int>(); |
There was a problem hiding this comment.
Can we move this inside the lambda function?
There was a problem hiding this comment.
This makes it more dry and readable IMO?
xwang233
left a comment
There was a problem hiding this comment.
Thanks for the work on cusolver @v0dro ! I've left some comments on the PR.
Would it be convenient to post a benchmark table with the numbers comparison of run time from magma, cusolver, cublas (if needed) of getrf? With that, we can refer to it in the future. Also, I can report this to the cusolver team and let them check and improve the performance to benefit us all. Thank you!
| self_data + batch * self_stride, | ||
| lda, | ||
| pivots_data + batch * pivots_matrix_stride, | ||
| infos_data + batch |
There was a problem hiding this comment.
If both pivots_tensor and infos_tensor are of shape req_size, do they have the same stride, which is 1 here?
There was a problem hiding this comment.
pivots_tensor has a size of N (matrix numrows) and infos_tensor has size batch_size. Each LU factorization returns a single info. If you perform an LU of a scalar then yes they will have size 1.
|
Thank you for your comments. I will get back to you soon with a review. |
13859d7 to
2401be7
Compare
|
@jithunnair-amd can you check now? |
21ba864 to
3355359
Compare
Ping @jithunnair-amd -- is your issue resolved? |
|
@jithunnair-amd ping! |
|
I'm getting this failure on XLA: Might this be related to the issue that AMD is facing? |
lezcano
left a comment
There was a problem hiding this comment.
Arrived late to the party, but just wanted to say that this LGTM!
| [&self, | ||
| &pivots, | ||
| &infos, | ||
| &get_pivots]() { |
There was a problem hiding this comment.
| &get_pivots]() { | |
| get_pivots]() { |
nit. Given that we are doing explicit capture, better to pass by value a bool.
|
The XLA failure is not related to your PR, and is probably already fixed on master. |
|
@mruberry, could you please import this PR? |
|
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
mruberry
left a comment
There was a problem hiding this comment.
Cool! Thanks @v0dro! And thanks @lezcano and @IvanYashchuk for reviewing.
Summary: This PR adds cuSOLVER path for `torch.lu`. Performance comparison results: pytorch#53879 (comment) Code for reproducing performance results: pytorch#56887 (comment) The following heuristics are used for choosing cuSOLVER over MAGMA: * If batch size == 1 OR (batch size <= 8 AND shape <= 16), choose cuSOLVER over MAGMA. * For all other cases use MAGMA. See also pytorch#47953. Following are the performance results between the MASTER branch and the current changes: <details> ``` [-------------------------- LU factorization (ATen) torch.float64 ---------------------------] | lu_factorize CURRENT | lu_factorize MASTER 1 threads: ----------------------------------------------------------------------------------- torch.Size([1, 1, 1]) | 363.9 | 284.1 torch.Size([2, 1, 1]) | 354.8 | 271.8 torch.Size([4, 1, 1]) | 393.7 | 278.0 torch.Size([8, 1, 1]) | 459.3 | 279.1 torch.Size([16, 1, 1]) | 524.2 | 288.9 torch.Size([32, 1, 1]) | 525.1 | 281.2 torch.Size([64, 1, 1]) | 524.5 | 281.7 torch.Size([128, 1, 1]) | 522.8 | 285.2 torch.Size([1, 2, 2]) | 360.4 | 277.7 torch.Size([2, 2, 2]) | 372.9 | 279.2 torch.Size([4, 2, 2]) | 419.4 | 278.3 torch.Size([8, 2, 2]) | 475.7 | 279.2 torch.Size([16, 2, 2]) | 530.0 | 299.5 torch.Size([32, 2, 2]) | 530.0 | 294.5 torch.Size([64, 2, 2]) | 531.0 | 291.5 torch.Size([128, 2, 2]) | 544.4 | 292.3 torch.Size([1, 8, 8]) | 372.6 | 292.8 torch.Size([2, 8, 8]) | 380.9 | 296.2 torch.Size([4, 8, 8]) | 420.0 | 293.4 torch.Size([8, 8, 8]) | 490.6 | 294.6 torch.Size([16, 8, 8]) | 535.6 | 296.5 torch.Size([32, 8, 8]) | 534.7 | 302.1 torch.Size([64, 8, 8]) | 539.1 | 305.5 torch.Size([128, 8, 8]) | 540.7 | 296.5 torch.Size([1, 16, 16]) | 345.0 | 303.2 torch.Size([2, 16, 16]) | 405.0 | 306.3 torch.Size([4, 16, 16]) | 482.8 | 305.6 torch.Size([8, 16, 16]) | 596.3 | 305.9 torch.Size([16, 16, 16]) | 539.6 | 304.4 torch.Size([32, 16, 16]) | 542.2 | 305.8 torch.Size([64, 16, 16]) | 556.1 | 311.0 torch.Size([128, 16, 16]) | 545.1 | 308.1 torch.Size([1, 32, 32]) | 432.7 | 342.4 torch.Size([2, 32, 32]) | 582.6 | 341.8 torch.Size([4, 32, 32]) | 580.4 | 344.4 torch.Size([8, 32, 32]) | 586.5 | 343.8 torch.Size([16, 32, 32]) | 582.9 | 346.0 torch.Size([32, 32, 32]) | 574.4 | 343.7 torch.Size([64, 32, 32]) | 562.8 | 350.8 torch.Size([128, 32, 32]) | 568.3 | 349.8 torch.Size([1, 64, 64]) | 537.1 | 518.4 torch.Size([2, 64, 64]) | 766.5 | 539.1 torch.Size([4, 64, 64]) | 771.6 | 551.9 torch.Size([8, 64, 64]) | 783.4 | 556.0 torch.Size([16, 64, 64]) | 798.8 | 555.3 torch.Size([32, 64, 64]) | 795.6 | 548.6 torch.Size([64, 64, 64]) | 804.2 | 580.4 torch.Size([128, 64, 64]) | 837.6 | 616.9 torch.Size([1, 128, 128]) | 844.7 | 848.9 torch.Size([2, 128, 128]) | 1096.7 | 873.3 torch.Size([4, 128, 128]) | 1117.9 | 884.8 torch.Size([8, 128, 128]) | 1138.1 | 903.6 torch.Size([16, 128, 128]) | 1169.1 | 943.9 torch.Size([32, 128, 128]) | 1204.8 | 981.4 torch.Size([64, 128, 128]) | 1336.6 | 1105.8 torch.Size([128, 128, 128]) | 1639.4 | 1473.3 torch.Size([1, 512, 512]) | 3714.3 | 3928.6 torch.Size([2, 512, 512]) | 4388.3 | 4179.7 torch.Size([4, 512, 512]) | 4765.4 | 4536.9 torch.Size([8, 512, 512]) | 5615.2 | 5441.1 torch.Size([16, 512, 512]) | 7203.6 | 7130.2 torch.Size([32, 512, 512]) | 10580.5 | 10503.9 torch.Size([64, 512, 512]) | 17374.8 | 17349.6 torch.Size([128, 512, 512]) | 32542.3 | 32548.8 torch.Size([1, 1024, 1024]) | 10041.5 | 14292.3 torch.Size([2, 1024, 1024]) | 17126.6 | 16971.0 torch.Size([4, 1024, 1024]) | 20591.0 | 20490.8 torch.Size([8, 1024, 1024]) | 27682.8 | 27560.7 torch.Size([16, 1024, 1024]) | 41035.2 | 41035.8 torch.Size([32, 1024, 1024]) | 67091.8 | 67345.9 torch.Size([64, 1024, 1024]) | 119612.3 | 119782.3 torch.Size([128, 1024, 1024]) | 230095.5 | 230766.2 Times are in microseconds (us). ``` </details> The main reason why a performance regression can be seen is related to this issue (pytorch#55122) and there seems to be no easy way to fix this (atleast in this PR). Pull Request resolved: pytorch#56887 Reviewed By: ngimel Differential Revision: D29482342 Pulled By: mruberry fbshipit-source-id: 4fdedf21b0d5597b289e168dff61d5f5d7727fb1
This PR adds cuSOLVER path for
torch.lu.Performance comparison results: #53879 (comment)
Code for reproducing performance results: #56887 (comment)
The following heuristics are used for choosing cuSOLVER over MAGMA:
See also #47953.
Following are the performance results between the MASTER branch and the current changes:
Details
The main reason why a performance regression can be seen is related to this issue (#55122) and there seems to be no easy way to fix this (atleast in this PR).