Skip to content

Batched SVD_LOWRANK being much slower than loop implementation (both CPU and GPU)  #56891

@hypnopump

Description

@hypnopump

Issue description

I've found the torch.svd_lowrank to be up to 2x slower in CPU and GPU for the batched implementation when compared to a loop implementation.

I guess if then the bacthed implemenation could be turned into a loop one so it is faster.

Note: i have found batched and loop implementations to be on par for low matrix sizes (n<2000 for n x n matrices), but very different for big sizes (both in cpu and gpu)

Code example

import torch
p = torch.randn(7000, 3)
d = torch.cdist(p, p, p=2)
# optional
# d = d.to(torch.device("cuda:0"))
# loop implementation - faster
u,s,v = [], [], []
for i in range(5):
    u_, s_, v_ = torch.svd_lowrank(d)
    u.append(u_)
    s.append(s_)
    v.append(v_)
u = torch.stack(u, dim=0)
s = torch.stack(s, dim=0)
v = torch.stack(v, dim=0)
# batched implementation - 2x slower
u,s,v = torch.svd_lowrank(torch.stack([d]*5, dim=0))

System Info

I run the script both in a MacBook Pro (torch cpu) and in colab (cuda) with the same results.

cc @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @xwang233 @lezcano @rgommers @VitalyFedyunin @ngimel

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulmodule: performanceIssues related to performance, either of kernel code or framework gluetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions