Add cusolver potrf and potrfBatched to the backend of torch.cholesky decomposition#53104
Add cusolver potrf and potrfBatched to the backend of torch.cholesky decomposition#53104xwang233 wants to merge 29 commits intopytorch:masterfrom
Conversation
BenchmarkSee https://github.com/xwang233/code-snippet/tree/master/linalg/cholesky for A100, V100 (pending) before master commit 1.9.0a0+git 68b6249 after commit 1.9.0a0+git ccb6d48 This page was tested on E5-2680 v3 and RTX 2070 Super. Libraries: intel-mkl 2020.2.254-1 and cuda 11.2
time is in us (10^-6 s)
|
|
reserved 2 |
💊 CI failures summary and remediationsAs of commit d52f95d (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
| if (self.numel() == 0) { | ||
| return at::empty_like(self, LEGACY_CONTIGUOUS_MEMORY_FORMAT); | ||
| } |
There was a problem hiding this comment.
Could you please remove this? cuSOLVER calls should handle this case correctly if the lda parameter is right. We don't need to add this early return ourselves.
There was a problem hiding this comment.
Zero matrix stride or zero batch size cause problem in the below arange call. Writing an if statement there would make no difference than an if statement here.
Besides that, if we can shortcut the functionality at a much earlier place, why do we want to rely on the much deeper implementation of LAPACK, cusolver, or MAGMA with zero batch/matrix size? We don't even have control to those implementations, and they may burn us in the future.
There was a problem hiding this comment.
Matrix of size 0x0 is not a special case and shouldn't be treated as such, I don't think 0xNxN should be treated as special either.
why do we want to rely on the much deeper implementation of LAPACK, cusolver, or MAGMA with zero batch/matrix size? We don't even have control to those implementations, and they may burn us in the future.
LAPACK API is de facto standard and I doubt that vendor implementations would deviate from reference LAPACK. Existing calls to LAPACK and MAGMA work without this early return. cuSOLVER's potrf will work as well. It's better to keep this kind of thing closer to the source of the problem and leaving a comment explaining why it's needed to be treated as a special case. In this case, I think it should be only in apply_cholesky_lib_potrfBatched and the problem is not in cuSOLVER's potrfBatched, it happily accepts batch size 0 and would error on negative values, but in arange.
Either an early return could be added there or arange call modified to something like
// cusolver batched kernels require input be "device array of device pointers"
Tensor self_working_copy_array = at::arange(
reinterpret_cast<int64_t>(self_data),
reinterpret_cast<int64_t>(&self_data[std::max<int>(batch_size - 1, 0) * matrix_stride]) + 1,
static_cast<int64_t>(std::max<int64_t>(matrix_stride, 1) * sizeof(scalar_t)),// step should be non-zero
self_working_copy.options().dtype(at::kLong));There was a problem hiding this comment.
potrfBatched says batchSize < 1 is an error https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-batchpotrf
There was a problem hiding this comment.
Short-circuiting on special cases also seems natural and readable.
There was a problem hiding this comment.
potrfBatched says batchSize < 1 is an error https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-batchpotrf
My bad, I mistakenly looked at potrsBatched which errors on batchSize < 0. Other functions that allow batchSize = 0 are gesvdjBatched, syevjBatched, trsmBatched, gelsBatched, and others from cuBLAS. So I'd say this is a bug for potrfBatched that it errors on batchSize < 1 😃
I still think it's better to keep short-circuiting closer to the source of the problem and it should be put next to the potrfBatched call. If someone decides to use these lower-level calls instead of the "front-end" linalg_cholesky function to develop other functionality then it puts the responsibility on that developer to know that this function doesn't work for zero batched case.
This reverts commit bb6f472.
heitorschueroff
left a comment
There was a problem hiding this comment.
@xwang233 Thank you for implementing this. The PR is looking good for the most part. The failures are real though, the linker can't find the symbol cusolverGetErrorMessage.
Besides that, there's one typo that I commented on. I'm also wondering if we should start using int32_t and int64_t more consistently, this PR uses a mix of both which is something we've done traditionally.
Let me know when you want me to take another look.
|
|
Only arguments that are passed to cuSOLVER need to be cast to |
|
Reviewers: this PR is ready to go. |
|
|
||
| Tensor _cholesky_helper_cuda(const Tensor& self, bool upper) { | ||
| #ifdef USE_CUSOLVER | ||
| if (batchCount(self) == 1 || !use_magma_) { |
There was a problem hiding this comment.
Would you add a TODO comment here to review using cuSOLVER for batches once the batched cuSOLVER issue is addressed, and add a link to the tracking issue you created?
| auto workdata_host = host_allocator.allocate(worksize_host * batch_size); | ||
| void* workdata_host_ptr = workdata_host.get(); | ||
|
|
||
| for (int64_t i = 0; i < batch_size; i++) { |
There was a problem hiding this comment.
Please add a comment here, too, explaining why this occurs in a loop (so people don't wonder why batched cuSOLVER isn't used)
mruberry
left a comment
There was a problem hiding this comment.
@ngimel and I approve!
Ping me when you'd like this merged. We just have two requests for comments.
Great work, Xiao! It's awesome to see Cholesky become so much faster for batch size <=1, and there are now no longer weird performance differences between no batching and a batch size of one.
Looking forward to the follow-up PR after cuSOLVER addresses their batching issues, too!
| // allocate workspace storage | ||
| auto& device_allocator = *at::cuda::getCUDADeviceAllocator(); | ||
| auto workdata_device = device_allocator.allocate(worksize_device * batch_size); | ||
| void* workdata_device_ptr = workdata_device.get(); |
There was a problem hiding this comment.
Is this device synchronizing allocation? If it borrows from the allocated memory pool, would it better to be more conservative with the memory and move this workspace allocation inside the batching loop?
There was a problem hiding this comment.
I'm not sure if it is a device-sync allocation. If it is, it would be inevitable. Wouldn't one allocation better than n allocation in most cases?
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
… 11.3 (#57788) Summary: This PR enables the usage of cusolver potrf batched as the backend of Cholesky decomposition (`torch.linalg.cholesky` and `torch.linalg.cholesky_ex`) when cuda version is greater than or equal to 11.3. Benchmark available at https://github.com/xwang233/code-snippet/tree/master/linalg/cholesky-new. It is seen that cusolver potrf batched performs better than magma potrf batched in most cases. ## cholesky dispatch heuristics: ### before: - batch size == 1: cusolver potrf - batch size > 1: magma xpotrf batched ### after: cuda >= 11.3: - batch size == 1: cusolver potrf - batch size > 1: cusolver potrf batched cuda < 11.3 (not changed): - batch size == 1: cusolver potrf - batch size > 1: magma xpotrf batched --- See also #42666 #47953 #53104 #53879 Pull Request resolved: #57788 Reviewed By: ngimel Differential Revision: D28345530 Pulled By: mruberry fbshipit-source-id: 3022cf73b2750e1953c0e00a9e8b093dfc551f61
… 11.3 (pytorch#57788) Summary: This PR enables the usage of cusolver potrf batched as the backend of Cholesky decomposition (`torch.linalg.cholesky` and `torch.linalg.cholesky_ex`) when cuda version is greater than or equal to 11.3. Benchmark available at https://github.com/xwang233/code-snippet/tree/master/linalg/cholesky-new. It is seen that cusolver potrf batched performs better than magma potrf batched in most cases. ## cholesky dispatch heuristics: ### before: - batch size == 1: cusolver potrf - batch size > 1: magma xpotrf batched ### after: cuda >= 11.3: - batch size == 1: cusolver potrf - batch size > 1: cusolver potrf batched cuda < 11.3 (not changed): - batch size == 1: cusolver potrf - batch size > 1: magma xpotrf batched --- See also pytorch#42666 pytorch#47953 pytorch#53104 pytorch#53879 Pull Request resolved: pytorch#57788 Reviewed By: ngimel Differential Revision: D28345530 Pulled By: mruberry fbshipit-source-id: 3022cf73b2750e1953c0e00a9e8b093dfc551f61
…decomposition (pytorch#53104) Summary: This PR adds cusolver potrf and potrfBatched to the backend of torch.cholesky and torch.linalg.cholesky. Cholesky heuristics: - Use cusolver potrf for batch_size 1 - Use magma_xpotrf_batched for batch_size >= 2 - if magma is not available, use loop of cusolver potrf for batch_size >= 2 cusolver potrf batched currently has some nan output issue, we will switch to cusolver potrf batched after it's fixed See also pytorch#42666 pytorch#47953 Todo: - [x] benchmark and heuristic Close pytorch#53992 Pull Request resolved: pytorch#53104 Reviewed By: agolynski Differential Revision: D27113963 Pulled By: mruberry fbshipit-source-id: 1429f63891cfc6176f9d8fdeb5c3b0617d750803
… 11.3 (pytorch#57788) Summary: This PR enables the usage of cusolver potrf batched as the backend of Cholesky decomposition (`torch.linalg.cholesky` and `torch.linalg.cholesky_ex`) when cuda version is greater than or equal to 11.3. Benchmark available at https://github.com/xwang233/code-snippet/tree/master/linalg/cholesky-new. It is seen that cusolver potrf batched performs better than magma potrf batched in most cases. ## cholesky dispatch heuristics: ### before: - batch size == 1: cusolver potrf - batch size > 1: magma xpotrf batched ### after: cuda >= 11.3: - batch size == 1: cusolver potrf - batch size > 1: cusolver potrf batched cuda < 11.3 (not changed): - batch size == 1: cusolver potrf - batch size > 1: magma xpotrf batched --- See also pytorch#42666 pytorch#47953 pytorch#53104 pytorch#53879 Pull Request resolved: pytorch#57788 Reviewed By: ngimel Differential Revision: D28345530 Pulled By: mruberry fbshipit-source-id: 3022cf73b2750e1953c0e00a9e8b093dfc551f61
This PR adds cusolver potrf and potrfBatched to the backend of torch.cholesky and torch.linalg.cholesky.
Cholesky heuristics:
cusolver potrf batched currently has some nan output issue, we will switch to cusolver potrf batched after it's fixed
See also #42666 #47953
Todo:
Close #53992