Add cusolver to build, rewrite MAGMA inverse with cusolver#42403
Add cusolver to build, rewrite MAGMA inverse with cusolver#42403
Conversation
💊 CI failures summary and remediationsAs of commit 1a0fe5a (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 1 fixed upstream failure:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
vishwakftw
left a comment
There was a problem hiding this comment.
I have some preliminary comments, I can take a pass at this PR once it's complete too.
IvanYashchuk
left a comment
There was a problem hiding this comment.
Hello @xwang233, great start with rewriting MAGMA calls to cuSolver! I have some experience with cuSolver, so I thought that I drop a few comments.
One small issue that is often overlooked that the leading dimension is not necessarily the same as one of the dimensions of the matrix. The LDA parameter in BLAS and LAPACK is the stride of the matrix as it is laid out in the contiguous memory.
For example zero-sized matrix is perfectly valid input, but the current implementation will fail with CUSOLVER_STATUS_INVALID_VALUE. For getrs it means: invalid parameters were passed (n<0 or lda<max(1,n) or ldb<max(1,n)).
vishwakftw
left a comment
There was a problem hiding this comment.
Some more minor comments.
|
@xwang233 If you can generate some benchmark numbers to support this transition, that would be great. |
|
@vishwakftw Thanks for the review! I'll get benchmarks for cpu, gpu-before (magma), gpu-after (cusolver) soon when this PR is ready. |
|
Ok, I see where the problem is. To avoid confusions, we usually name our functions differently from library functions.
Also, the current code sometimes use
|
ngimel
left a comment
There was a problem hiding this comment.
I have a small question on the test, otherwise it is good to go. Thanks, great work!
| matrix_inverse_out = torch.empty(*batches, n, n, dtype=torch.float64, device=device) | ||
| torch.inverse(matrix, out=matrix_inverse_out) | ||
| self.assertEqual(matrix_inverse_out, matrix_inverse, atol=0, rtol=0) | ||
| # second call, now that matrix_inverse_out is transposed |
There was a problem hiding this comment.
I don't see matrix_inverse_out being transposed, at least prior to this PR. Do we need this second case?
There was a problem hiding this comment.
Yes, you are right. I tried that and the out=matrix_inverse_out doesn't seem to be transposed, if matrix_inverse_out wasn't transposed before. I'll modify that and remove the second run.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
|
||
| template <> | ||
| void getrfBatched<double>( | ||
| int _m, int n, double** dA_array, int ldda, int* ipiv_array, int* info_array, int batchsize) { |
There was a problem hiding this comment.
sorry, did not notice this earlier, but linter complains about unused param _m here (CUDABLAS_GETRF_ARGTYPES probably also need to be adjusted?)
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…-stream issue (#47026) Summary: ### test_inverse_singular for cublas failure Related #46616 (comment) https://app.circleci.com/pipelines/github/pytorch/pytorch/232112/workflows/4131d4ca-cd51-44e3-8e6c-b1c3555c62fa/jobs/8523970/tests The cuda 11.1 CI container doesn't have MAGMA library, so cublas matrix inverse path is enabled. ``` Oct 27 23:13:47 -- MAGMA not found. Compiling without MAGMA support ``` The test_inverse_singular was introduced in #46625, but I forgot to fix that functionality for cublas path as well. ### cusolver inverse multi-stream failure fix #47272 The original cuda event record/block stream was wrong, which could cause NaN in output tensor. On my machine, the original code observes NaN in about 50k~500k loops. After this change, no NaN is observed in more than 2.5m loops. The performance for batch 2 matrix inverse is still the same as those in #42403. Pull Request resolved: #47026 Reviewed By: mruberry Differential Revision: D24838546 Pulled By: ngimel fbshipit-source-id: 3b83e4ab8e6b47a8273cba277251765bd6d97911
…ytorch#42403)" This reverts commit d75c402.
…ytorch#42403)" This reverts commit d75c402.
…ytorch#42403)" This reverts commit d75c402.
…-stream issue (pytorch#47026) Summary: ### test_inverse_singular for cublas failure Related pytorch#46616 (comment) https://app.circleci.com/pipelines/github/pytorch/pytorch/232112/workflows/4131d4ca-cd51-44e3-8e6c-b1c3555c62fa/jobs/8523970/tests The cuda 11.1 CI container doesn't have MAGMA library, so cublas matrix inverse path is enabled. ``` Oct 27 23:13:47 -- MAGMA not found. Compiling without MAGMA support ``` The test_inverse_singular was introduced in pytorch#46625, but I forgot to fix that functionality for cublas path as well. ### cusolver inverse multi-stream failure fix pytorch#47272 The original cuda event record/block stream was wrong, which could cause NaN in output tensor. On my machine, the original code observes NaN in about 50k~500k loops. After this change, no NaN is observed in more than 2.5m loops. The performance for batch 2 matrix inverse is still the same as those in pytorch#42403. Pull Request resolved: pytorch#47026 Reviewed By: mruberry Differential Revision: D24838546 Pulled By: ngimel fbshipit-source-id: 3b83e4ab8e6b47a8273cba277251765bd6d97911
…2403) Summary: Fixes pytorch#42265 This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes. Specifically, when * the tensor is two dimensional (single batch), or * has >2 dimensions (multiple batches) and `batch_size <= 2`, or * magma is not linked, cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used. https://github.com/pytorch/pytorch/blob/8c0949ae454b1d2c1b626a5ea19ba5ea6487d305/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu#L742-L752 The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl. On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA. https://github.com/pytorch/pytorch/blob/060769feaf02db56ac79e0c73dab1105828ece69/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h#L10-L13 Note that there is a new heuristic used before cusolver/cublas calls here: https://github.com/pytorch/pytorch/blob/8c0949ae454b1d2c1b626a5ea19ba5ea6487d305/aten/src/ATen/native/cuda/MiscUtils.h#L113-L121 where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma). Checklist: - [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver) - [X] Rewrite single inverse (ndim == 2) with cusolver - [X] Rewrite batched inverse (ndim > 2) with cublas - [X] Add cusolver to build - [x] Clean up functions related to `USE_MAGMA` define guard - [x] Workaround for non-cuda platform - [x] Workaround for cuda 9.2 - [x] Add zero size check - [x] Add tests Next step: If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance. <details> <summary> benchmark 73499c6 </summary> benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb shape meaning: * `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)` * `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)` | shape | cpu_time (ms) | gpu_time_before (magma) (ms) | gpu_time_after (ms) | | --- | --- | --- | --- | | [] 2 torch.float32 | 0.095 | 7.534 | 0.129 | | [] 4 torch.float32 | 0.009 | 7.522 | 0.129 | | [] 8 torch.float32 | 0.011 | 7.647 | 0.138 | | [] 16 torch.float32 | 0.075 | 7.582 | 0.135 | | [] 32 torch.float32 | 0.073 | 7.573 | 0.191 | | [] 64 torch.float32 | 0.134 | 7.694 | 0.288 | | [] 128 torch.float32 | 0.398 | 8.073 | 0.491 | | [] 256 torch.float32 | 1.054 | 11.860 | 1.074 | | [] 512 torch.float32 | 5.218 | 14.130 | 2.582 | | [] 1024 torch.float32 | 19.010 | 18.780 | 6.936 | | [1] 2 torch.float32 | 0.009 | 0.113 | 0.128 ***regressed | | [1] 4 torch.float32 | 0.009 | 0.113 | 0.131 ***regressed | | [1] 8 torch.float32 | 0.011 | 0.116 | 0.129 ***regressed | | [1] 16 torch.float32 | 0.015 | 0.122 | 0.135 ***regressed | | [1] 32 torch.float32 | 0.032 | 0.177 | 0.178 ***regressed | | [1] 64 torch.float32 | 0.070 | 0.420 | 0.281 | | [1] 128 torch.float32 | 0.328 | 0.816 | 0.490 | | [1] 256 torch.float32 | 1.125 | 1.690 | 1.084 | | [1] 512 torch.float32 | 4.344 | 4.305 | 2.576 | | [1] 1024 torch.float32 | 16.510 | 16.340 | 6.928 | | [2] 2 torch.float32 | 0.009 | 0.113 | 0.186 ***regressed | | [2] 4 torch.float32 | 0.011 | 0.115 | 0.184 ***regressed | | [2] 8 torch.float32 | 0.012 | 0.114 | 0.184 ***regressed | | [2] 16 torch.float32 | 0.019 | 0.119 | 0.173 ***regressed | | [2] 32 torch.float32 | 0.050 | 0.170 | 0.240 ***regressed | | [2] 64 torch.float32 | 0.120 | 0.429 | 0.375 | | [2] 128 torch.float32 | 0.576 | 0.830 | 0.675 | | [2] 256 torch.float32 | 2.021 | 1.748 | 1.451 | | [2] 512 torch.float32 | 9.070 | 4.749 | 3.539 | | [2] 1024 torch.float32 | 33.655 | 18.240 | 12.220 | | [4] 2 torch.float32 | 0.009 | 0.112 | 0.318 ***regressed | | [4] 4 torch.float32 | 0.010 | 0.115 | 0.319 ***regressed | | [4] 8 torch.float32 | 0.013 | 0.115 | 0.320 ***regressed | | [4] 16 torch.float32 | 0.027 | 0.120 | 0.331 ***regressed | | [4] 32 torch.float32 | 0.085 | 0.173 | 0.385 ***regressed | | [4] 64 torch.float32 | 0.221 | 0.431 | 0.646 ***regressed | | [4] 128 torch.float32 | 1.102 | 0.834 | 1.055 ***regressed | | [4] 256 torch.float32 | 4.042 | 1.811 | 2.054 ***regressed | | [4] 512 torch.float32 | 18.390 | 4.884 | 5.087 ***regressed | | [4] 1024 torch.float32 | 69.025 | 19.840 | 20.000 ***regressed | </details> Pull Request resolved: pytorch#42403 Reviewed By: ailzhang, mruberry Differential Revision: D23717984 Pulled By: ngimel fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b
…-stream issue (pytorch#47026) Summary: ### test_inverse_singular for cublas failure Related pytorch#46616 (comment) https://app.circleci.com/pipelines/github/pytorch/pytorch/232112/workflows/4131d4ca-cd51-44e3-8e6c-b1c3555c62fa/jobs/8523970/tests The cuda 11.1 CI container doesn't have MAGMA library, so cublas matrix inverse path is enabled. ``` Oct 27 23:13:47 -- MAGMA not found. Compiling without MAGMA support ``` The test_inverse_singular was introduced in pytorch#46625, but I forgot to fix that functionality for cublas path as well. ### cusolver inverse multi-stream failure fix pytorch#47272 The original cuda event record/block stream was wrong, which could cause NaN in output tensor. On my machine, the original code observes NaN in about 50k~500k loops. After this change, no NaN is observed in more than 2.5m loops. The performance for batch 2 matrix inverse is still the same as those in pytorch#42403. Pull Request resolved: pytorch#47026 Reviewed By: mruberry Differential Revision: D24838546 Pulled By: ngimel fbshipit-source-id: 3b83e4ab8e6b47a8273cba277251765bd6d97911
Fixes #42265
This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU
torch.inverseon certain tensor shapes.Specifically, when
batch_size <= 2, orcusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used.
pytorch/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu
Lines 742 to 752 in 8c0949a
The reason for this is that for tensors with large batch_size,
cublasXgetrfBatchedandcublasXgetriBatcheddoesn't perform very well. Forbatch_size > 1, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. Whenbatch_size > 2, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl.On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA.
pytorch/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h
Lines 10 to 13 in 060769f
Note that there is a new heuristic used before cusolver/cublas calls here:
pytorch/aten/src/ATen/native/cuda/MiscUtils.h
Lines 113 to 121 in 8c0949a
where
use_loop_launch = truemeans launch single batch cusolver functions in parallel, anduse_loop_launch = falsemeans use cublas_X_batched functions. When magma is enabled (onlybatch_size <= 2will be dispatched to cusolver/cublas), the heuristic will always returntrueand the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality oftorch.inverse, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma).Checklist:
USE_MAGMAdefine guardNext step:
If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance.
benchmark 73499c6
benchmark code: https://github.com/xwang233/code-snippet/blob/master/linalg/inverse/inverse-cusolver.ipynb
shape meaning:
[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)