Skip to content

Add cusolver potrf and potrfBatched to the backend of torch.cholesky decomposition#53104

Closed
xwang233 wants to merge 29 commits intopytorch:masterfrom
xwang233:cusolver-cholesky-1
Closed

Add cusolver potrf and potrfBatched to the backend of torch.cholesky decomposition#53104
xwang233 wants to merge 29 commits intopytorch:masterfrom
xwang233:cusolver-cholesky-1

Conversation

@xwang233
Copy link
Copy Markdown
Collaborator

@xwang233 xwang233 commented Mar 2, 2021

This PR adds cusolver potrf and potrfBatched to the backend of torch.cholesky and torch.linalg.cholesky.

Cholesky heuristics:

  • Use cusolver potrf for batch_size 1
  • Use magma_xpotrf_batched for batch_size >= 2
  • if magma is not available, use loop of cusolver potrf for batch_size >= 2

cusolver potrf batched currently has some nan output issue, we will switch to cusolver potrf batched after it's fixed

See also #42666 #47953

Todo:

  • benchmark and heuristic

Close #53992

@xwang233
Copy link
Copy Markdown
Collaborator Author

xwang233 commented Mar 2, 2021

Benchmark

See https://github.com/xwang233/code-snippet/tree/master/linalg/cholesky for A100, V100 (pending)

before master commit 1.9.0a0+git 68b6249

after commit 1.9.0a0+git ccb6d48

This page was tested on E5-2680 v3 and RTX 2070 Super. Libraries: intel-mkl 2020.2.254-1 and cuda 11.2

  • before: magma
  • after: potrf (when batch_size == 1) + magma (when batch_size >= 2) [this PR]

time is in us (10^-6 s)

shape cpu before_magma after_potrf64bit_and_magma
[] 2 torch.float32 66.774 2280.221 58.959
[] 4 torch.float32 14.175 2303.458 59.223
[] 8 torch.float32 18.633 2326.320 59.109
[] 16 torch.float32 20.404 2358.930 59.451
[] 32 torch.float32 23.551 2341.613 59.565
[] 64 torch.float32 50.053 2383.232 83.184
[] 128 torch.float32 173.814 2431.619 120.336
[] 256 torch.float32 437.671 2884.719 208.042
[] 512 torch.float32 1672.962 3219.271 400.172
[] 1024 torch.float32 6052.655 3948.443 972.376
[] 2048 torch.float32 41786.490 6239.588 2771.738
[1] 2 torch.float32 16.018 55.177 58.924
[1] 4 torch.float32 15.890 55.194 64.280
[1] 8 torch.float32 16.391 56.533 59.009
[1] 16 torch.float32 17.996 62.417 59.372
[1] 32 torch.float32 21.377 74.961 59.963
[1] 64 torch.float32 54.598 102.557 75.136
[1] 128 torch.float32 153.839 170.707 112.467
[1] 256 torch.float32 411.058 374.914 186.747
[1] 512 torch.float32 1176.441 1147.444 359.730
[1] 1024 torch.float32 4504.277 4157.900 972.515
[1] 2048 torch.float32 28452.310 6390.159 2773.273
[2] 2 torch.float32 17.597 54.883 39.668
[2] 4 torch.float32 18.118 55.135 40.283
[2] 8 torch.float32 19.303 56.541 41.505
[2] 16 torch.float32 22.752 61.926 47.370
[2] 32 torch.float32 28.764 74.519 59.570
[2] 64 torch.float32 71.314 101.504 86.426
[2] 128 torch.float32 292.647 168.731 154.935
[2] 256 torch.float32 552.208 369.279 354.285
[2] 512 torch.float32 1600.543 1204.858 1201.538
[2] 1024 torch.float32 5972.632 4448.121 4521.013
[2] 2048 torch.float32 39196.011 7395.678 7465.609
[4] 2 torch.float32 18.258 41.088 40.741
[4] 4 torch.float32 18.169 41.878 40.869
[4] 8 torch.float32 20.840 42.951 43.534
[4] 16 torch.float32 27.483 48.655 48.096
[4] 32 torch.float32 41.822 61.765 60.854
[4] 64 torch.float32 138.791 91.196 91.095
[4] 128 torch.float32 574.721 166.014 165.793
[4] 256 torch.float32 757.351 385.509 384.977
[4] 512 torch.float32 2342.804 1304.622 1304.067
[4] 1024 torch.float32 9071.045 4886.211 4929.194
[4] 2048 torch.float32 57826.840 9442.459 9631.141
[8] 2 torch.float32 17.141 41.769 42.027
[8] 4 torch.float32 16.745 43.375 41.145
[8] 8 torch.float32 19.215 44.121 42.852
[8] 16 torch.float32 27.425 50.869 48.796
[8] 32 torch.float32 46.709 64.079 62.668
[8] 64 torch.float32 174.225 96.937 95.206
[8] 128 torch.float32 762.877 176.674 175.077
[8] 256 torch.float32 1386.097 432.471 429.544
[8] 512 torch.float32 4195.243 1462.709 1457.487
[8] 1024 torch.float32 19305.512 5786.699 5868.911
[16] 2 torch.float32 19.890 42.434 45.466
[16] 4 torch.float32 20.660 43.252 43.327
[16] 8 torch.float32 26.371 44.373 46.226
[16] 16 torch.float32 48.751 51.100 50.815
[16] 32 torch.float32 94.477 64.965 63.677
[16] 64 torch.float32 377.310 98.290 96.308
[16] 128 torch.float32 1474.216 185.933 184.481
[16] 256 torch.float32 2405.352 492.626 490.029
[16] 512 torch.float32 7828.434 1754.295 1753.410
[32] 2 torch.float32 21.078 42.380 40.783
[32] 4 torch.float32 22.185 43.250 42.487
[32] 8 torch.float32 31.124 44.699 43.080
[32] 16 torch.float32 87.727 50.902 49.999
[32] 32 torch.float32 131.084 65.271 64.570
[32] 64 torch.float32 539.154 99.046 97.589
[32] 128 torch.float32 2753.116 214.874 215.273
[32] 256 torch.float32 4813.017 610.045 610.006
[64] 2 torch.float32 27.769 42.636 40.654
[64] 4 torch.float32 29.520 43.522 41.364
[64] 8 torch.float32 47.020 44.746 43.315
[64] 16 torch.float32 132.615 50.989 49.638
[64] 32 torch.float32 217.310 65.572 63.921
[64] 64 torch.float32 986.923 104.305 104.182
[64] 128 torch.float32 5390.051 271.441 271.572
[128] 2 torch.float32 50.669 42.892 42.474
[128] 4 torch.float32 60.421 43.840 41.955
[128] 8 torch.float32 96.448 44.832 42.927
[128] 16 torch.float32 219.459 51.180 50.108
[128] 32 torch.float32 364.438 66.161 64.927
[128] 64 torch.float32 1903.152 133.194 133.470
[256] 2 torch.float32 67.254 44.754 42.825
[256] 4 torch.float32 80.547 45.447 43.048
[256] 8 torch.float32 146.505 47.612 46.427
[256] 16 torch.float32 332.615 54.235 51.697
[256] 32 torch.float32 660.817 74.383 73.969
[512] 2 torch.float32 118.439 46.233 43.343
[512] 4 torch.float32 144.591 47.107 45.033
[512] 8 torch.float32 267.354 49.280 47.363
[512] 16 torch.float32 581.125 57.851 56.217
[1024] 2 torch.float32 218.353 48.189 45.818
[1024] 4 torch.float32 276.394 48.978 49.839
[1024] 8 torch.float32 451.058 53.205 50.737

@xwang233
Copy link
Copy Markdown
Collaborator Author

xwang233 commented Mar 2, 2021

reserved 2

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Mar 2, 2021

💊 CI failures summary and remediations

As of commit d52f95d (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@IvanYashchuk IvanYashchuk self-requested a review March 2, 2021 10:48
Comment on lines +887 to +889
if (self.numel() == 0) {
return at::empty_like(self, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remove this? cuSOLVER calls should handle this case correctly if the lda parameter is right. We don't need to add this early return ourselves.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero matrix stride or zero batch size cause problem in the below arange call. Writing an if statement there would make no difference than an if statement here.

Besides that, if we can shortcut the functionality at a much earlier place, why do we want to rely on the much deeper implementation of LAPACK, cusolver, or MAGMA with zero batch/matrix size? We don't even have control to those implementations, and they may burn us in the future.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matrix of size 0x0 is not a special case and shouldn't be treated as such, I don't think 0xNxN should be treated as special either.

why do we want to rely on the much deeper implementation of LAPACK, cusolver, or MAGMA with zero batch/matrix size? We don't even have control to those implementations, and they may burn us in the future.

LAPACK API is de facto standard and I doubt that vendor implementations would deviate from reference LAPACK. Existing calls to LAPACK and MAGMA work without this early return. cuSOLVER's potrf will work as well. It's better to keep this kind of thing closer to the source of the problem and leaving a comment explaining why it's needed to be treated as a special case. In this case, I think it should be only in apply_cholesky_lib_potrfBatched and the problem is not in cuSOLVER's potrfBatched, it happily accepts batch size 0 and would error on negative values, but in arange.
Either an early return could be added there or arange call modified to something like

  // cusolver batched kernels require input be "device array of device pointers"
  Tensor self_working_copy_array = at::arange(
    reinterpret_cast<int64_t>(self_data),
    reinterpret_cast<int64_t>(&self_data[std::max<int>(batch_size - 1,  0) * matrix_stride]) + 1,
    static_cast<int64_t>(std::max<int64_t>(matrix_stride, 1) * sizeof(scalar_t)),// step should be non-zero
    self_working_copy.options().dtype(at::kLong));

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry What are your thoughts?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short-circuiting on special cases also seems natural and readable.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potrfBatched says batchSize < 1 is an error https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-batchpotrf

My bad, I mistakenly looked at potrsBatched which errors on batchSize < 0. Other functions that allow batchSize = 0 are gesvdjBatched, syevjBatched, trsmBatched, gelsBatched, and others from cuBLAS. So I'd say this is a bug for potrfBatched that it errors on batchSize < 1 😃

I still think it's better to keep short-circuiting closer to the source of the problem and it should be put next to the potrfBatched call. If someone decides to use these lower-level calls instead of the "front-end" linalg_cholesky function to develop other functionality then it puts the responsibility on that developer to know that this function doesn't work for zero batched case.

Comment thread aten/src/ATen/native/cuda/BatchLinearAlgebra.cu Outdated
Comment thread aten/src/ATen/native/cuda/BatchLinearAlgebraLib.cu Outdated
Comment thread aten/src/ATen/native/cuda/BatchLinearAlgebraLib.cu Outdated
@IvanYashchuk IvanYashchuk self-requested a review March 2, 2021 11:05
@heitorschueroff heitorschueroff self-requested a review March 3, 2021 16:08
@xwang233 xwang233 requested a review from mruberry March 3, 2021 22:09
Copy link
Copy Markdown
Contributor

@heitorschueroff heitorschueroff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xwang233 Thank you for implementing this. The PR is looking good for the most part. The failures are real though, the linker can't find the symbol cusolverGetErrorMessage.

Besides that, there's one typo that I commented on. I'm also wondering if we should start using int32_t and int64_t more consistently, this PR uses a mix of both which is something we've done traditionally.

Let me know when you want me to take another look.

Comment thread aten/src/ATen/native/cuda/BatchLinearAlgebraLib.cu Outdated
@IvanYashchuk
Copy link
Copy Markdown
Collaborator

I'm also wondering if we should start using int32_t and int64_t more consistently, this PR uses a mix of both which is something we've done traditionally.

int is used because that's what cuSOLVER requires. There is a new 64-bit API starting from CUDA 11 available for some functions. There is int64_t version for Cholesky: https://docs.nvidia.com/cuda/cusolver/index.html#cusolverDnXpotrf

@IvanYashchuk
Copy link
Copy Markdown
Collaborator

Only arguments that are passed to cuSOLVER need to be cast to int with cuda_int_cast. Everything else, for example, batch_size in case of a for loop of potrf or matrix_stride should be int64_t.

@xwang233 xwang233 changed the title [WIP] Add cusolver potrf and potrfBatched to the backend of torch.cholesky decomposition Add cusolver potrf and potrfBatched to the backend of torch.cholesky decomposition Mar 13, 2021
@xwang233
Copy link
Copy Markdown
Collaborator Author

Reviewers: this PR is ready to go.


Tensor _cholesky_helper_cuda(const Tensor& self, bool upper) {
#ifdef USE_CUSOLVER
if (batchCount(self) == 1 || !use_magma_) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add a TODO comment here to review using cuSOLVER for batches once the batched cuSOLVER issue is addressed, and add a link to the tracking issue you created?

auto workdata_host = host_allocator.allocate(worksize_host * batch_size);
void* workdata_host_ptr = workdata_host.get();

for (int64_t i = 0; i < batch_size; i++) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here, too, explaining why this occurs in a loop (so people don't wonder why batched cuSOLVER isn't used)

@mruberry mruberry self-requested a review March 15, 2021 21:28
Copy link
Copy Markdown
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel and I approve!

Ping me when you'd like this merged. We just have two requests for comments.

Great work, Xiao! It's awesome to see Cholesky become so much faster for batch size <=1, and there are now no longer weird performance differences between no batching and a batch size of one.

Looking forward to the follow-up PR after cuSOLVER addresses their batching issues, too!

Comment on lines +414 to +417
// allocate workspace storage
auto& device_allocator = *at::cuda::getCUDADeviceAllocator();
auto workdata_device = device_allocator.allocate(worksize_device * batch_size);
void* workdata_device_ptr = workdata_device.get();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this device synchronizing allocation? If it borrows from the allocated memory pool, would it better to be more conservative with the memory and move this workspace allocation inside the batching loop?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is a device-sync allocation. If it is, it would be inevitable. Wouldn't one allocation better than n allocation in most cases?

@xwang233
Copy link
Copy Markdown
Collaborator Author

Thanks @mruberry , @ngimel . I've added the comments. The CI is green and this PR is ready to go.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@mruberry merged this pull request in 2f3b194.

facebook-github-bot pushed a commit that referenced this pull request May 11, 2021
… 11.3 (#57788)

Summary:
This PR enables the usage of cusolver potrf batched as the backend of Cholesky decomposition (`torch.linalg.cholesky` and `torch.linalg.cholesky_ex`) when cuda version is greater than or equal to 11.3.

Benchmark available at https://github.com/xwang233/code-snippet/tree/master/linalg/cholesky-new. It is seen that cusolver potrf batched performs better than magma potrf batched in most cases.

## cholesky dispatch heuristics:

### before:

- batch size == 1: cusolver potrf
- batch size > 1: magma xpotrf batched

### after:

cuda >= 11.3:
- batch size == 1: cusolver potrf
- batch size > 1: cusolver potrf batched

cuda < 11.3 (not changed):
- batch size == 1: cusolver potrf
- batch size > 1: magma xpotrf batched

 ---

See also #42666 #47953 #53104 #53879

Pull Request resolved: #57788

Reviewed By: ngimel

Differential Revision: D28345530

Pulled By: mruberry

fbshipit-source-id: 3022cf73b2750e1953c0e00a9e8b093dfc551f61
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
… 11.3 (pytorch#57788)

Summary:
This PR enables the usage of cusolver potrf batched as the backend of Cholesky decomposition (`torch.linalg.cholesky` and `torch.linalg.cholesky_ex`) when cuda version is greater than or equal to 11.3.

Benchmark available at https://github.com/xwang233/code-snippet/tree/master/linalg/cholesky-new. It is seen that cusolver potrf batched performs better than magma potrf batched in most cases.

## cholesky dispatch heuristics:

### before:

- batch size == 1: cusolver potrf
- batch size > 1: magma xpotrf batched

### after:

cuda >= 11.3:
- batch size == 1: cusolver potrf
- batch size > 1: cusolver potrf batched

cuda < 11.3 (not changed):
- batch size == 1: cusolver potrf
- batch size > 1: magma xpotrf batched

 ---

See also pytorch#42666 pytorch#47953 pytorch#53104 pytorch#53879

Pull Request resolved: pytorch#57788

Reviewed By: ngimel

Differential Revision: D28345530

Pulled By: mruberry

fbshipit-source-id: 3022cf73b2750e1953c0e00a9e8b093dfc551f61
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…decomposition (pytorch#53104)

Summary:
This PR adds cusolver potrf and potrfBatched to the backend of torch.cholesky and torch.linalg.cholesky.

Cholesky heuristics:

- Use cusolver potrf for batch_size 1
- Use magma_xpotrf_batched for batch_size >= 2
- if magma is not available, use loop of cusolver potrf for batch_size >= 2

cusolver potrf batched currently has some nan output issue, we will switch to cusolver potrf batched after it's fixed

See also pytorch#42666 pytorch#47953

Todo:

- [x] benchmark and heuristic

Close pytorch#53992

Pull Request resolved: pytorch#53104

Reviewed By: agolynski

Differential Revision: D27113963

Pulled By: mruberry

fbshipit-source-id: 1429f63891cfc6176f9d8fdeb5c3b0617d750803
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
… 11.3 (pytorch#57788)

Summary:
This PR enables the usage of cusolver potrf batched as the backend of Cholesky decomposition (`torch.linalg.cholesky` and `torch.linalg.cholesky_ex`) when cuda version is greater than or equal to 11.3.

Benchmark available at https://github.com/xwang233/code-snippet/tree/master/linalg/cholesky-new. It is seen that cusolver potrf batched performs better than magma potrf batched in most cases.

## cholesky dispatch heuristics:

### before:

- batch size == 1: cusolver potrf
- batch size > 1: magma xpotrf batched

### after:

cuda >= 11.3:
- batch size == 1: cusolver potrf
- batch size > 1: cusolver potrf batched

cuda < 11.3 (not changed):
- batch size == 1: cusolver potrf
- batch size > 1: magma xpotrf batched

 ---

See also pytorch#42666 pytorch#47953 pytorch#53104 pytorch#53879

Pull Request resolved: pytorch#57788

Reviewed By: ngimel

Differential Revision: D28345530

Pulled By: mruberry

fbshipit-source-id: 3022cf73b2750e1953c0e00a9e8b093dfc551f61
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants