Skip to content

Avoid CUDA sync in spectral decompositions #174601

@alexshtf

Description

@alexshtf

🚀 The feature, motivation and pitch

Current docs of spectral decompositions say that they require a device <--> CPU sync. I don't know if it's because of the sync, but in practice I am able to compute spectral decompositions for a batch of matrices using CuPy interop via DLPack approximatley 10 times faster than with PyTorch - and this includes a custom Autograd function that also implements backprop for me (500x100x100 tensor, of 500 matrices of size 100x100).

I am assuming that if CuPy does it quickly, then NVidia already provided some primitives to do it quickly. So I am asking, if it's possible, to use them in PyTorch internally as well.

Alternatives

Continue using CuPy/DLPack interop.

Additional context

I am working on researching trainable models whose nonlinear activations are matrix eigenvalues. Making it fast will significantly accelerate my experiments without any special CuPy hacks.

cc @jerryzh168 @ptrblck @msaroufim @eqy @tinglvv @nWEIdia @jianyuh @nikitaved @mruberry @walterddr @xwang233 @lezcano

Metadata

Metadata

Assignees

No one assigned

    Labels

    bot-triagedThis is a label only to be used by the auto triage botfeatureA request for a proper, new feature.module: cudaRelated to torch.cuda, and CUDA support in generalmodule: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulmodule: performanceIssues related to performance, either of kernel code or framework gluetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions