Batched SVD_LOWRANK being much slower than loop implementation (both CPU and GPU) 

## Issue description

I've found the `torch.svd_lowrank` to be up to 2x slower in CPU and GPU for the batched implementation when compared to a loop implementation. 

I guess if then the bacthed implemenation could be turned into a loop one so it is faster. 

**Note**: i have found batched and loop implementations to be on par for low matrix sizes (n<2000 for n x n matrices), but very different for big sizes (both in cpu and gpu)

## Code example

```python
import torch
p = torch.randn(7000, 3)
d = torch.cdist(p, p, p=2)
# optional
# d = d.to(torch.device("cuda:0"))
# loop implementation - faster
u,s,v = [], [], []
for i in range(5):
    u_, s_, v_ = torch.svd_lowrank(d)
    u.append(u_)
    s.append(s_)
    v.append(v_)
u = torch.stack(u, dim=0)
s = torch.stack(s, dim=0)
v = torch.stack(v, dim=0)
# batched implementation - 2x slower
u,s,v = torch.svd_lowrank(torch.stack([d]*5, dim=0))
```

## System Info
I run the script both in a MacBook Pro (torch cpu) and in colab (cuda) with the same results.

cc @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @xwang233 @Lezcano @rgommers @VitalyFedyunin @ngimel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched SVD_LOWRANK being much slower than loop implementation (both CPU and GPU) #56891

Issue description

Code example

System Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batched SVD_LOWRANK being much slower than loop implementation (both CPU and GPU) #56891

Description

Issue description

Code example

System Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions