Avoid CUDA sync in spectral decompositions

### 🚀 The feature, motivation and pitch

Current docs of spectral decompositions say that they require a device <--> CPU sync. I don't know if it's because of the sync, but in practice I am able to compute spectral decompositions for a batch of matrices  using CuPy interop via DLPack approximatley **10** times faster than with PyTorch - and this includes a custom Autograd function that also implements backprop for me (500x100x100 tensor, of 500 matrices of size 100x100).

I am assuming that if CuPy does it quickly, then NVidia already provided some primitives to do it quickly. So I am asking, if it's possible, to use them in PyTorch internally as well.

### Alternatives

Continue using CuPy/DLPack interop.

### Additional context

I am working on researching trainable models whose nonlinear activations are matrix eigenvalues. Making it fast will significantly accelerate my experiments without any special CuPy hacks.

cc @jerryzh168 @ptrblck @msaroufim @eqy @tinglvv @nWEIdia @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid CUDA sync in spectral decompositions #174601

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Avoid CUDA sync in spectral decompositions #174601

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions