Add cusolver potrf and potrfBatched to the backend of torch.cholesky decomposition by xwang233 · Pull Request #53104 · pytorch/pytorch

xwang233 · 2021-03-02T10:16:46Z

This PR adds cusolver potrf and potrfBatched to the backend of torch.cholesky and torch.linalg.cholesky.

Cholesky heuristics:

Use cusolver potrf for batch_size 1
Use magma_xpotrf_batched for batch_size >= 2
if magma is not available, use loop of cusolver potrf for batch_size >= 2

cusolver potrf batched currently has some nan output issue, we will switch to cusolver potrf batched after it's fixed

Benchmark

See https://github.com/xwang233/code-snippet/tree/master/linalg/cholesky for A100, V100 (pending)

before master commit 1.9.0a0+git 68b6249

after commit 1.9.0a0+git ccb6d48

This page was tested on E5-2680 v3 and RTX 2070 Super. Libraries: intel-mkl 2020.2.254-1 and cuda 11.2

before: magma
after: potrf (when batch_size == 1) + magma (when batch_size >= 2) [this PR]

time is in us (10^-6 s)

shape	cpu	before_magma	after_potrf64bit_and_magma
[] 2 torch.float32	66.774	2280.221	58.959
[] 4 torch.float32	14.175	2303.458	59.223
[] 8 torch.float32	18.633	2326.320	59.109
[] 16 torch.float32	20.404	2358.930	59.451
[] 32 torch.float32	23.551	2341.613	59.565
[] 64 torch.float32	50.053	2383.232	83.184
[] 128 torch.float32	173.814	2431.619	120.336
[] 256 torch.float32	437.671	2884.719	208.042
[] 512 torch.float32	1672.962	3219.271	400.172
[] 1024 torch.float32	6052.655	3948.443	972.376
[] 2048 torch.float32	41786.490	6239.588	2771.738
[1] 2 torch.float32	16.018	55.177	58.924
[1] 4 torch.float32	15.890	55.194	64.280
[1] 8 torch.float32	16.391	56.533	59.009
[1] 16 torch.float32	17.996	62.417	59.372
[1] 32 torch.float32	21.377	74.961	59.963
[1] 64 torch.float32	54.598	102.557	75.136
[1] 128 torch.float32	153.839	170.707	112.467
[1] 256 torch.float32	411.058	374.914	186.747
[1] 512 torch.float32	1176.441	1147.444	359.730
[1] 1024 torch.float32	4504.277	4157.900	972.515
[1] 2048 torch.float32	28452.310	6390.159	2773.273
[2] 2 torch.float32	17.597	54.883	39.668
[2] 4 torch.float32	18.118	55.135	40.283
[2] 8 torch.float32	19.303	56.541	41.505
[2] 16 torch.float32	22.752	61.926	47.370
[2] 32 torch.float32	28.764	74.519	59.570
[2] 64 torch.float32	71.314	101.504	86.426
[2] 128 torch.float32	292.647	168.731	154.935
[2] 256 torch.float32	552.208	369.279	354.285
[2] 512 torch.float32	1600.543	1204.858	1201.538
[2] 1024 torch.float32	5972.632	4448.121	4521.013
[2] 2048 torch.float32	39196.011	7395.678	7465.609
[4] 2 torch.float32	18.258	41.088	40.741
[4] 4 torch.float32	18.169	41.878	40.869
[4] 8 torch.float32	20.840	42.951	43.534
[4] 16 torch.float32	27.483	48.655	48.096
[4] 32 torch.float32	41.822	61.765	60.854
[4] 64 torch.float32	138.791	91.196	91.095
[4] 128 torch.float32	574.721	166.014	165.793
[4] 256 torch.float32	757.351	385.509	384.977
[4] 512 torch.float32	2342.804	1304.622	1304.067
[4] 1024 torch.float32	9071.045	4886.211	4929.194
[4] 2048 torch.float32	57826.840	9442.459	9631.141
[8] 2 torch.float32	17.141	41.769	42.027
[8] 4 torch.float32	16.745	43.375	41.145
[8] 8 torch.float32	19.215	44.121	42.852
[8] 16 torch.float32	27.425	50.869	48.796
[8] 32 torch.float32	46.709	64.079	62.668
[8] 64 torch.float32	174.225	96.937	95.206
[8] 128 torch.float32	762.877	176.674	175.077
[8] 256 torch.float32	1386.097	432.471	429.544
[8] 512 torch.float32	4195.243	1462.709	1457.487
[8] 1024 torch.float32	19305.512	5786.699	5868.911
[16] 2 torch.float32	19.890	42.434	45.466
[16] 4 torch.float32	20.660	43.252	43.327
[16] 8 torch.float32	26.371	44.373	46.226
[16] 16 torch.float32	48.751	51.100	50.815
[16] 32 torch.float32	94.477	64.965	63.677
[16] 64 torch.float32	377.310	98.290	96.308
[16] 128 torch.float32	1474.216	185.933	184.481
[16] 256 torch.float32	2405.352	492.626	490.029
[16] 512 torch.float32	7828.434	1754.295	1753.410
[32] 2 torch.float32	21.078	42.380	40.783
[32] 4 torch.float32	22.185	43.250	42.487
[32] 8 torch.float32	31.124	44.699	43.080
[32] 16 torch.float32	87.727	50.902	49.999
[32] 32 torch.float32	131.084	65.271	64.570
[32] 64 torch.float32	539.154	99.046	97.589
[32] 128 torch.float32	2753.116	214.874	215.273
[32] 256 torch.float32	4813.017	610.045	610.006
[64] 2 torch.float32	27.769	42.636	40.654
[64] 4 torch.float32	29.520	43.522	41.364
[64] 8 torch.float32	47.020	44.746	43.315
[64] 16 torch.float32	132.615	50.989	49.638
[64] 32 torch.float32	217.310	65.572	63.921
[64] 64 torch.float32	986.923	104.305	104.182
[64] 128 torch.float32	5390.051	271.441	271.572
[128] 2 torch.float32	50.669	42.892	42.474
[128] 4 torch.float32	60.421	43.840	41.955
[128] 8 torch.float32	96.448	44.832	42.927
[128] 16 torch.float32	219.459	51.180	50.108
[128] 32 torch.float32	364.438	66.161	64.927
[128] 64 torch.float32	1903.152	133.194	133.470
[256] 2 torch.float32	67.254	44.754	42.825
[256] 4 torch.float32	80.547	45.447	43.048
[256] 8 torch.float32	146.505	47.612	46.427
[256] 16 torch.float32	332.615	54.235	51.697
[256] 32 torch.float32	660.817	74.383	73.969
[512] 2 torch.float32	118.439	46.233	43.343
[512] 4 torch.float32	144.591	47.107	45.033
[512] 8 torch.float32	267.354	49.280	47.363
[512] 16 torch.float32	581.125	57.851	56.217
[1024] 2 torch.float32	218.353	48.189	45.818
[1024] 4 torch.float32	276.394	48.978	49.839
[1024] 8 torch.float32	451.058	53.205	50.737

xwang233 · 2021-03-02T10:16:56Z

reserved 2

facebook-github-bot · 2021-03-02T10:17:00Z

💊 CI failures summary and remediations

As of commit d52f95d (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

IvanYashchuk · 2021-03-02T10:51:46Z

+  if (self.numel() == 0) {
+    return at::empty_like(self, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
+  }


Could you please remove this? cuSOLVER calls should handle this case correctly if the lda parameter is right. We don't need to add this early return ourselves.

Zero matrix stride or zero batch size cause problem in the below arange call. Writing an if statement there would make no difference than an if statement here.

Besides that, if we can shortcut the functionality at a much earlier place, why do we want to rely on the much deeper implementation of LAPACK, cusolver, or MAGMA with zero batch/matrix size? We don't even have control to those implementations, and they may burn us in the future.

Matrix of size 0x0 is not a special case and shouldn't be treated as such, I don't think 0xNxN should be treated as special either.

why do we want to rely on the much deeper implementation of LAPACK, cusolver, or MAGMA with zero batch/matrix size? We don't even have control to those implementations, and they may burn us in the future.

LAPACK API is de facto standard and I doubt that vendor implementations would deviate from reference LAPACK. Existing calls to LAPACK and MAGMA work without this early return. cuSOLVER's potrf will work as well. It's better to keep this kind of thing closer to the source of the problem and leaving a comment explaining why it's needed to be treated as a special case. In this case, I think it should be only in apply_cholesky_lib_potrfBatched and the problem is not in cuSOLVER's potrfBatched, it happily accepts batch size 0 and would error on negative values, but in arange.
Either an early return could be added there or arange call modified to something like

// cusolver batched kernels require input be "device array of device pointers" Tensor self_working_copy_array = at::arange( reinterpret_cast<int64_t>(self_data), reinterpret_cast<int64_t>(&self_data[std::max<int>(batch_size - 1, 0) * matrix_stride]) + 1, static_cast<int64_t>(std::max<int64_t>(matrix_stride, 1) * sizeof(scalar_t)),// step should be non-zero self_working_copy.options().dtype(at::kLong));

@mruberry What are your thoughts?

potrfBatched says batchSize < 1 is an error https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-batchpotrf

Short-circuiting on special cases also seems natural and readable.

potrfBatched says batchSize < 1 is an error https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-batchpotrf

My bad, I mistakenly looked at potrsBatched which errors on batchSize < 0. Other functions that allow batchSize = 0 are gesvdjBatched, syevjBatched, trsmBatched, gelsBatched, and others from cuBLAS. So I'd say this is a bug for potrfBatched that it errors on batchSize < 1 😃

I still think it's better to keep short-circuiting closer to the source of the problem and it should be put next to the potrfBatched call. If someone decides to use these lower-level calls instead of the "front-end" linalg_cholesky function to develop other functionality then it puts the responsibility on that developer to know that this function doesn't work for zero batched case.

This reverts commit bb6f472.

heitorschueroff

@xwang233 Thank you for implementing this. The PR is looking good for the most part. The failures are real though, the linker can't find the symbol cusolverGetErrorMessage.

Besides that, there's one typo that I commented on. I'm also wondering if we should start using int32_t and int64_t more consistently, this PR uses a mix of both which is something we've done traditionally.

Let me know when you want me to take another look.

IvanYashchuk · 2021-03-04T21:22:04Z

I'm also wondering if we should start using int32_t and int64_t more consistently, this PR uses a mix of both which is something we've done traditionally.

int is used because that's what cuSOLVER requires. There is a new 64-bit API starting from CUDA 11 available for some functions. There is int64_t version for Cholesky: https://docs.nvidia.com/cuda/cusolver/index.html#cusolverDnXpotrf

IvanYashchuk · 2021-03-04T21:25:25Z

Only arguments that are passed to cuSOLVER need to be cast to int with cuda_int_cast. Everything else, for example, batch_size in case of a for loop of potrf or matrix_stride should be int64_t.

This reverts commit 93fb8f7.

This reverts commit 740540b.

xwang233 · 2021-03-13T02:37:20Z

Reviewers: this PR is ready to go.

mruberry · 2021-03-15T21:21:03Z


+Tensor _cholesky_helper_cuda(const Tensor& self, bool upper) {
+#ifdef USE_CUSOLVER
+  if (batchCount(self) == 1 || !use_magma_) {


Would you add a TODO comment here to review using cuSOLVER for batches once the batched cuSOLVER issue is addressed, and add a link to the tracking issue you created?

mruberry · 2021-03-15T21:22:13Z

+  auto workdata_host = host_allocator.allocate(worksize_host * batch_size);
+  void* workdata_host_ptr = workdata_host.get();
+
+  for (int64_t i = 0; i < batch_size; i++) {


Please add a comment here, too, explaining why this occurs in a loop (so people don't wonder why batched cuSOLVER isn't used)

mruberry

@ngimel and I approve!

Ping me when you'd like this merged. We just have two requests for comments.

Great work, Xiao! It's awesome to see Cholesky become so much faster for batch size <=1, and there are now no longer weird performance differences between no batching and a batch size of one.

Looking forward to the follow-up PR after cuSOLVER addresses their batching issues, too!

IvanYashchuk · 2021-03-16T06:20:44Z

+  // allocate workspace storage
+  auto& device_allocator = *at::cuda::getCUDADeviceAllocator();
+  auto workdata_device = device_allocator.allocate(worksize_device * batch_size);
+  void* workdata_device_ptr = workdata_device.get();


Is this device synchronizing allocation? If it borrows from the allocated memory pool, would it better to be more conservative with the memory and move this workspace allocation inside the batching loop?

I'm not sure if it is a device-sync allocation. If it is, it would be inevitable. Wouldn't one allocation better than n allocation in most cases?

xwang233 · 2021-03-16T07:09:04Z

Thanks @mruberry , @ngimel . I've added the comments. The CI is green and this PR is ready to go.

facebook-github-bot

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-03-18T12:37:47Z

@mruberry merged this pull request in 2f3b194.

… 11.3 (#57788) Summary: This PR enables the usage of cusolver potrf batched as the backend of Cholesky decomposition (`torch.linalg.cholesky` and `torch.linalg.cholesky_ex`) when cuda version is greater than or equal to 11.3. Benchmark available at https://github.com/xwang233/code-snippet/tree/master/linalg/cholesky-new. It is seen that cusolver potrf batched performs better than magma potrf batched in most cases. ## cholesky dispatch heuristics: ### before: - batch size == 1: cusolver potrf - batch size > 1: magma xpotrf batched ### after: cuda >= 11.3: - batch size == 1: cusolver potrf - batch size > 1: cusolver potrf batched cuda < 11.3 (not changed): - batch size == 1: cusolver potrf - batch size > 1: magma xpotrf batched --- See also #42666 #47953 #53104 #53879 Pull Request resolved: #57788 Reviewed By: ngimel Differential Revision: D28345530 Pulled By: mruberry fbshipit-source-id: 3022cf73b2750e1953c0e00a9e8b093dfc551f61

… 11.3 (pytorch#57788) Summary: This PR enables the usage of cusolver potrf batched as the backend of Cholesky decomposition (`torch.linalg.cholesky` and `torch.linalg.cholesky_ex`) when cuda version is greater than or equal to 11.3. Benchmark available at https://github.com/xwang233/code-snippet/tree/master/linalg/cholesky-new. It is seen that cusolver potrf batched performs better than magma potrf batched in most cases. ## cholesky dispatch heuristics: ### before: - batch size == 1: cusolver potrf - batch size > 1: magma xpotrf batched ### after: cuda >= 11.3: - batch size == 1: cusolver potrf - batch size > 1: cusolver potrf batched cuda < 11.3 (not changed): - batch size == 1: cusolver potrf - batch size > 1: magma xpotrf batched --- See also pytorch#42666 pytorch#47953 pytorch#53104 pytorch#53879 Pull Request resolved: pytorch#57788 Reviewed By: ngimel Differential Revision: D28345530 Pulled By: mruberry fbshipit-source-id: 3022cf73b2750e1953c0e00a9e8b093dfc551f61

…decomposition (pytorch#53104) Summary: This PR adds cusolver potrf and potrfBatched to the backend of torch.cholesky and torch.linalg.cholesky. Cholesky heuristics: - Use cusolver potrf for batch_size 1 - Use magma_xpotrf_batched for batch_size >= 2 - if magma is not available, use loop of cusolver potrf for batch_size >= 2 cusolver potrf batched currently has some nan output issue, we will switch to cusolver potrf batched after it's fixed See also pytorch#42666 pytorch#47953 Todo: - [x] benchmark and heuristic Close pytorch#53992 Pull Request resolved: pytorch#53104 Reviewed By: agolynski Differential Revision: D27113963 Pulled By: mruberry fbshipit-source-id: 1429f63891cfc6176f9d8fdeb5c3b0617d750803

… 11.3 (pytorch#57788) Summary: This PR enables the usage of cusolver potrf batched as the backend of Cholesky decomposition (`torch.linalg.cholesky` and `torch.linalg.cholesky_ex`) when cuda version is greater than or equal to 11.3. Benchmark available at https://github.com/xwang233/code-snippet/tree/master/linalg/cholesky-new. It is seen that cusolver potrf batched performs better than magma potrf batched in most cases. ## cholesky dispatch heuristics: ### before: - batch size == 1: cusolver potrf - batch size > 1: magma xpotrf batched ### after: cuda >= 11.3: - batch size == 1: cusolver potrf - batch size > 1: cusolver potrf batched cuda < 11.3 (not changed): - batch size == 1: cusolver potrf - batch size > 1: magma xpotrf batched --- See also pytorch#42666 pytorch#47953 pytorch#53104 pytorch#53879 Pull Request resolved: pytorch#57788 Reviewed By: ngimel Differential Revision: D28345530 Pulled By: mruberry fbshipit-source-id: 3022cf73b2750e1953c0e00a9e8b093dfc551f61

xwang233 added 2 commits March 2, 2021 02:08

cusolver cholesky

9caef32

typo

a58e4c1

facebook-github-bot added the cla signed label Mar 2, 2021

pytorchbot added the open source label Mar 2, 2021

IvanYashchuk self-requested a review March 2, 2021 10:48

IvanYashchuk reviewed Mar 2, 2021

View reviewed changes

Comment thread aten/src/ATen/native/cuda/BatchLinearAlgebra.cu Outdated

IvanYashchuk reviewed Mar 2, 2021

View reviewed changes

Comment thread aten/src/ATen/native/cuda/BatchLinearAlgebraLib.cu Outdated

Comment thread aten/src/ATen/native/cuda/BatchLinearAlgebraLib.cu Outdated

IvanYashchuk self-requested a review March 2, 2021 11:05

xwang233 added 2 commits March 2, 2021 21:45

change order

09c1260

Merge remote-tracking branch 'upstream/master' into cusolver-cholesky-1

2acb249

heitorschueroff self-requested a review March 3, 2021 16:08

xwang233 requested a review from mruberry March 3, 2021 22:09

xwang233 added 2 commits March 3, 2021 15:50

try torch_cuda_cpp_api

bb6f472

Revert "try torch_cuda_cpp_api"

dce610d

This reverts commit bb6f472.

heitorschueroff reviewed Mar 4, 2021

View reviewed changes

Comment thread aten/src/ATen/native/cuda/BatchLinearAlgebraLib.cu Outdated

xwang233 mentioned this pull request Mar 9, 2021

Added cuBLAS path for torch.triangular_solve #53147

Closed

xwang233 added 8 commits March 10, 2021 11:01

Merge remote-tracking branch 'upstream/master' into cusolver-cholesky-1

7b59086

try extern

93fb8f7

trailing whitespace

b355036

Revert "try extern"

c63fa1d

This reverts commit 93fb8f7.

try link cusolver in cuda.cmake

740540b

typo

8a57998

Revert "try link cusolver in cuda.cmake"

15621a6

This reverts commit 740540b.

move cusolver related files to libtorch_cuda_cu

1c9060f

xwang233 added 4 commits March 12, 2021 15:47

rename

e267a47

add potrf_buffersize, use loop of potrf for batch

faa1480

heuristics

d6cde7b

trailing whitespace

405f4d8

xwang233 changed the title ~~[WIP] Add cusolver potrf and potrfBatched to the backend of torch.cholesky decomposition~~ Add cusolver potrf and potrfBatched to the backend of torch.cholesky decomposition Mar 13, 2021

xwang233 added 5 commits March 14, 2021 23:15

use cusolver generic 64-bit API

9137ea6

trailing whitespace

9364dc9

revert orgkr whitespace change

e037189

Merge remote-tracking branch 'upstream/master' into cusolver-cholesky-1

1a0a507

fix void pointer arithmetic

ccb6d48

mruberry reviewed Mar 15, 2021

View reviewed changes

mruberry self-requested a review March 15, 2021 21:28

mruberry approved these changes Mar 15, 2021

View reviewed changes

xwang233 added 2 commits March 15, 2021 15:25

comments

3bc1a26

trailing whitespaces

d52f95d

IvanYashchuk reviewed Mar 16, 2021

View reviewed changes

facebook-github-bot reviewed Mar 17, 2021

View reviewed changes

facebook-github-bot closed this in 2f3b194 Mar 18, 2021

facebook-github-bot added the Merged label Mar 18, 2021

IvanYashchuk mentioned this pull request Mar 18, 2021

Linear algebra GPU backend tracking issue [magma/cusolver/cublas] #47953

Open

xwang233 mentioned this pull request May 7, 2021

Enable cusolver potrf batched for Cholesky decomposition when cuda >= 11.3 #57788

Closed

Conversation

xwang233 commented Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xwang233 commented Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

xwang233 commented Mar 2, 2021

Uh oh!

facebook-github-bot commented Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heitorschueroff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

IvanYashchuk commented Mar 4, 2021

Uh oh!

IvanYashchuk commented Mar 4, 2021

Uh oh!

xwang233 commented Mar 13, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mruberry left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xwang233 commented Mar 16, 2021

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Mar 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

xwang233 commented Mar 2, 2021 •

edited

Loading

xwang233 commented Mar 2, 2021 •

edited

Loading

facebook-github-bot commented Mar 2, 2021 •

edited

Loading