-
Notifications
You must be signed in to change notification settings - Fork 27.7k
SVD is slow on GPU vs CPU for skinny matrices #41306
Copy link
Copy link
Closed
Labels
module: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulmodule: performanceIssues related to performance, either of kernel code or framework glueIssues related to performance, either of kernel code or framework gluetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Metadata
Metadata
Assignees
Labels
module: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulmodule: performanceIssues related to performance, either of kernel code or framework glueIssues related to performance, either of kernel code or framework gluetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
🐛 Bug
Performing SVD on the GPU is extremely slow and as far as I it is an open research quetion whether SVD in general can gain a lot by performing it on the GPU. So I want to propose to execute it on the CPU by default.
To Reproduce
CPU:
Result:
CUDA:
Result:
Here is a figure illustrating the behavior depending on the batch_size:

Expected behavior
The CUDA result should be at least as good as the CPU otherwise there is no point in using CUDA.
Environment
conda,pip, source): pipAdditional context
I am aware that I could manually move the tensor to the CPU before computing the SVD, but this has several drawbacks:
torch.slogdetuse SVD internally during the backward pass if the tensor is singular. This can only be avoided by implementing custom gradients or by performingslogdeton the CPU despiteslogdetbeing faster with CUDA.cc @vincentqb @vishwakftw @ssnl @jianyuh @VitalyFedyunin @ngimel