-
Notifications
You must be signed in to change notification settings - Fork 27.7k
Avoid CUDA sync in spectral decompositions #174601
Copy link
Copy link
Closed
Labels
bot-triagedThis is a label only to be used by the auto triage botThis is a label only to be used by the auto triage botfeatureA request for a proper, new feature.A request for a proper, new feature.module: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generalmodule: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulmodule: performanceIssues related to performance, either of kernel code or framework glueIssues related to performance, either of kernel code or framework gluetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Metadata
Metadata
Assignees
Labels
bot-triagedThis is a label only to be used by the auto triage botThis is a label only to be used by the auto triage botfeatureA request for a proper, new feature.A request for a proper, new feature.module: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generalmodule: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulmodule: performanceIssues related to performance, either of kernel code or framework glueIssues related to performance, either of kernel code or framework gluetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
🚀 The feature, motivation and pitch
Current docs of spectral decompositions say that they require a device <--> CPU sync. I don't know if it's because of the sync, but in practice I am able to compute spectral decompositions for a batch of matrices using CuPy interop via DLPack approximatley 10 times faster than with PyTorch - and this includes a custom Autograd function that also implements backprop for me (500x100x100 tensor, of 500 matrices of size 100x100).
I am assuming that if CuPy does it quickly, then NVidia already provided some primitives to do it quickly. So I am asking, if it's possible, to use them in PyTorch internally as well.
Alternatives
Continue using CuPy/DLPack interop.
Additional context
I am working on researching trainable models whose nonlinear activations are matrix eigenvalues. Making it fast will significantly accelerate my experiments without any special CuPy hacks.
cc @jerryzh168 @ptrblck @msaroufim @eqy @tinglvv @nWEIdia @jianyuh @nikitaved @mruberry @walterddr @xwang233 @lezcano