Port CUDA torch.geqrf to ATen#56251
Port CUDA torch.geqrf to ATen#56251IvanYashchuk wants to merge 8 commits intogh/ivanyashchuk/12/basefrom
Conversation
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 2843955 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 ghstack-source-id: 9d6b500 Pull Request resolved: pytorch#56251
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 ghstack-source-id: ecdc580 Pull Request resolved: pytorch#56251
| dispatch: | ||
| CPU: geqrf | ||
| CUDA: legacy::cuda::_th_geqrf | ||
| CPU, CUDA: geqrf |
There was a problem hiding this comment.
You don't have to do it this PR, but consider making this structured! https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md
|
|
||
| // magmaGeqrf uses a hybrid CPU-GPU algorithm to compute the elementary reflectors. | ||
| // The driver routine geqrf2_gpu accepts a tensor on the CPU for elementary reflectors. | ||
| Tensor tau_cpu = at::empty(tau.sizes(), tau.options().device(at::kCPU)); |
There was a problem hiding this comment.
does it make sense to allocate this in pinned memory and use non-blocking copy below? The effect is probably negligible, but I don't know off the top of my head.
There was a problem hiding this comment.
I think it wouldn't hurt.
Is it simply Tensor tau_cpu = at::empty(tau.sizes(), tau.options().device(at::kCPU).pinned_memory(true)); ?
There was a problem hiding this comment.
Yes, and ,/*non_blocking*/true for copy_ below, iirc.
There was a problem hiding this comment.
Thank you! It did give non-negligible speedup:
| | before | after |
|------------------------------|--------|-------|
| torch.Size([2, 2]) | 3.0 | 2.3 |
| torch.Size([2, 2, 2]) | 5.8 | 4.4 |
| torch.Size([32, 2, 2]) | 88.3 | 70.6 |
| torch.Size([8, 8]) | 2.9 | 2.2 |
| torch.Size([2, 8, 8]) | 5.7 | 4.4 |
| torch.Size([32, 8, 8]) | 88.3 | 70.7 |
| torch.Size([16, 16]) | 3.0 | 2.2 |
| torch.Size([2, 16, 16]) | 5.7 | 4.5 |
| torch.Size([32, 16, 16]) | 89.0 | 72.4 |
| torch.Size([32, 32]) | 3.0 | 2.3 |
| torch.Size([2, 32, 32]) | 5.8 | 4.6 |
| torch.Size([32, 32, 32]) | 91.6 | 72.7 |
| torch.Size([64, 64]) | 3.1 | 2.4 |
| torch.Size([2, 64, 64]) | 6.0 | 4.7 |
| torch.Size([32, 64, 64]) | 93.1 | 74.0 |
| torch.Size([128, 128]) | 3.5 | 2.8 |
| torch.Size([2, 128, 128]) | 6.9 | 5.5 |
| torch.Size([32, 128, 128]) | 107.8 | 88.1 |
| torch.Size([256, 256]) | 4.7 | 3.7 |
| torch.Size([2, 256, 256]) | 9.2 | 7.2 |
| torch.Size([32, 256, 256]) | 144.1 | 112.1 |
| torch.Size([512, 512]) | 6.9 | 6.0 |
| torch.Size([2, 512, 512]) | 13.3 | 11.8 |
| torch.Size([32, 512, 512]) | 209.3 | 184.6 |
| torch.Size([1024, 1024]) | 15.4 | 14.6 |
| torch.Size([2, 1024, 1024]) | 30.5 | 28.9 |
| torch.Size([32, 1024, 1024]) | 486.5 | 462.4 |
Times are in milliseconds (ms).
There was a problem hiding this comment.
I just hope magma is synchronizing when it is writing tau, otherwise nonblocking copy might produce wrong result (it will copy tau to gpu before it's properly populated)
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 ghstack-source-id: 3827eca Pull Request resolved: pytorch#56251
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 ghstack-source-id: 3e76211 Pull Request resolved: pytorch#56251
|
Hey @IvanYashchuk, unfortunately the internal tools are unable to merge this; would you try rebasing the stack now that the first 2 prs are in? |
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 Differential Revision: [D27960155](https://our.internmc.facebook.com/intern/diff/D27960155) [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 ghstack-source-id: 501c5f3 Pull Request resolved: pytorch#56251
|
@mruberry, I rebased the stack. |
Summary: Pull Request resolved: pytorch#56251 This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D27960155 Pulled By: mruberry fbshipit-source-id: a8b010c41d703a5de4bf40b045c89e6b95b5a5ca
Summary: Pull Request resolved: pytorch#56251 This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D27960155 Pulled By: mruberry fbshipit-source-id: a8b010c41d703a5de4bf40b045c89e6b95b5a5ca
Summary: Pull Request resolved: pytorch#56251 This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D27960155 Pulled By: mruberry fbshipit-source-id: a8b010c41d703a5de4bf40b045c89e6b95b5a5ca
Stack from ghstack:
This PR ports
torch.geqrffrom TH to ATen for CUDA path.Resolves #24569
Differential Revision: D27960155