Micro-optimisations for matmul#64387
Conversation
A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. [ghstack-poisoned]
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 7b18adf (more details on the Dr. CI page):
🕵️ 7 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. ghstack-source-id: f654d1d Pull Request resolved: #64387
A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @xwang233 @lezcano [ghstack-poisoned]
A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @xwang233 @lezcano [ghstack-poisoned]
A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. ghstack-source-id: ef79746 Pull Request resolved: #64387
A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano [ghstack-poisoned]
A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. ghstack-source-id: 1e996bb Pull Request resolved: #64387
|
The following script: from torch.utils.benchmark import Timer
import timeit
timer = Timer("torch::matmul(m1, m2);", setup="at::Tensor m1=torch::zeros({1,1,1});at::Tensor
m2=torch::zeros({1,1,1});", language="c++", timer=timeit.default_timer)
stats=timer.collect_callgrind(number=30, repeats=3)
print(stats[1].as_standardized().stats(inclusive=False))Throws an instruction count of Using shapes |
A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano [ghstack-poisoned]
|
A more thorough instruction bench.
Note in particular the 50% instructions in the case when the rhs is a batch of matrices and the lhs is a vector. The reduction in the case when we use the out version is likely higher, but I did not bench that branch as it's not used nearly as much as the main branch. This PR implements a number of general principles to try to save on unnecessary computations:
This is ready for review @ngimel |
A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano [ghstack-poisoned]
A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. ghstack-source-id: 4aa9e4a Pull Request resolved: #64387
A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano [ghstack-poisoned]
A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. ghstack-source-id: ec7134d Pull Request resolved: #64387
ngimel
left a comment
There was a problem hiding this comment.
Sorry for delay with review, this looks great, I have small questions about testing.
| const Tensor output = at::_unsafe_view(at::mm_out(*out_opt, t1, tensor2), output_size); | ||
| return out_opt->set_(output); | ||
| } else { | ||
| const Tensor output = at::_unsafe_view(at::native::mv_out(t1, tensor2, *out_opt), output_size); |
There was a problem hiding this comment.
ugh, native::mv_out and at::mm_out couple lines above is pretty confusing (not to mention their different order of out argument)
There was a problem hiding this comment.
I know... There is no native::mm_out so I could not call it directly...
test/test_linalg.py
Outdated
| assertEqual(ans, expected) | ||
|
|
||
| out = torch.zeros(*shape, dtype=torch.int64).to(x.device) | ||
| out = torch.empty((0,), dtype=x.dtype, device=x.device) |
There was a problem hiding this comment.
To make tests more robust, you probably want to allocate out of the correct shape filled with nans here, but you got rid of the shape, so I don't know how feasible it is.
There was a problem hiding this comment.
Luckily we compute the non-out version just before, so we have the shapes there :)
A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano [ghstack-poisoned]
Here's a small PR that only fixes the extra heap allocations for shapes. Hopefully won't get stuck like #64387. Differential Revision: [D33962610](https://our.internmc.facebook.com/intern/diff/D33962610/) [ghstack-poisoned]
Pull Request resolved: #72230 Here's a small PR that only fixes the extra heap allocations for shapes. Hopefully won't get stuck like #64387. ghstack-source-id: 148446641 Differential Revision: [D33962610](https://our.internmc.facebook.com/intern/diff/D33962610/)
Here's a small PR that only fixes the extra heap allocations for shapes. Hopefully won't get stuck like #64387. Differential Revision: [D33962610](https://our.internmc.facebook.com/intern/diff/D33962610/) [ghstack-poisoned]
Here's a small PR that only fixes the extra heap allocations for shapes. Hopefully won't get stuck like #64387. Differential Revision: [D33962610](https://our.internmc.facebook.com/intern/diff/D33962610/) [ghstack-poisoned]
Pull Request resolved: #72230 Here's a small PR that only fixes the extra heap allocations for shapes. Hopefully won't get stuck like #64387. ghstack-source-id: 148556529 Differential Revision: [D33962610](https://our.internmc.facebook.com/intern/diff/D33962610/)
|
@lezcano what's the status on this |
|
I have to have another look at this one and split it into several smaller PRs to find where's the problem. I'll do that soon™ |
Pull Request resolved: #72230 Here's a small PR that only fixes the extra heap allocations for shapes. Hopefully won't get stuck like #64387. ghstack-source-id: 149069527 Differential Revision: [D33962610](https://our.internmc.facebook.com/intern/diff/D33962610/)
Here's a small PR that only fixes the extra heap allocations for shapes. Hopefully won't get stuck like #64387. Differential Revision: [D33962610](https://our.internmc.facebook.com/intern/diff/D33962610/) [ghstack-poisoned]
Here's a small PR that only fixes the extra heap allocations for shapes. Hopefully won't get stuck like #64387. Differential Revision: [D33962610](https://our.internmc.facebook.com/intern/diff/D33962610/) [ghstack-poisoned]
Summary: Pull Request resolved: #72230 Here's a small PR that only fixes the extra heap allocations for shapes. Hopefully won't get stuck like #64387. ghstack-source-id: 149069527 Test Plan: CI Reviewed By: ngimel Differential Revision: D33962610 fbshipit-source-id: 51e200f5237bdf225bfb2445e1e36bacd0e65e92
Summary: Pull Request resolved: #72230 Here's a small PR that only fixes the extra heap allocations for shapes. Hopefully won't get stuck like #64387. ghstack-source-id: 149069527 Test Plan: CI Reviewed By: ngimel Differential Revision: D33962610 fbshipit-source-id: 51e200f5237bdf225bfb2445e1e36bacd0e65e92 (cherry picked from commit 027537f)
|
@pytorchbot ciflow rerun |
|
This command didn't do anything. |
|
@pytorchbot ciflow rerun -l ciflow/all |
|
This command didn't do anything. |
This PR implements the bulk of #64387 Part of the optimisations were already merged in #72230 A number of these optimisations include: - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. - Call `reshape` rather than `expect_contiguous` + `view` On top of these, it fixes a correctness issue of `matmul_out`, where the out parameter was not resized correctly when passed to the backends. This involves removing the use of `set_` from the calling code, as requested by @ezyang, and it incurs on most of the complexity of the code that this PR adds. [ghstack-poisoned]
|
This PR has been split into a number of PRs to try to isolate the CI errors. See #75193 |
…ic boogaloo" This PR implements the bulk of #64387 Part of the optimisations were already merged in #72230 A number of these optimisations include: - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. - Call `reshape` rather than `expect_contiguous` + `view` On top of these, it fixes a correctness issue of `matmul_out`, where the out parameter was not resized correctly when passed to the backends. This involves removing the use of `set_` from the calling code, as requested by ezyang, and it incurs on most of the complexity of the code that this PR adds. [ghstack-poisoned]
This PR implements the bulk of #64387 Part of the optimisations were already merged in #72230 A number of these optimisations include: - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. - Call `reshape` rather than `expect_contiguous` + `view` On top of these, it fixes a correctness issue of `matmul_out`, where the out parameter was not resized correctly when passed to the backends. This involves removing the use of `set_` from the calling code, as requested by ezyang, and it incurs on most of the complexity of the code that this PR adds. [ghstack-poisoned]
…ic boogaloo" This PR implements the bulk of #64387 Part of the optimisations were already merged in #72230 A number of these optimisations include: - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. - Call `reshape` rather than `expect_contiguous` + `view` On top of these, it fixes a correctness issue of `matmul_out`, where the out parameter was not resized correctly when passed to the backends. This involves removing the use of `set_` from the calling code, as requested by ezyang, and it incurs on most of the complexity of the code that this PR adds. [ghstack-poisoned]
This PR implements the bulk of #64387 Part of the optimisations were already merged in #72230 A number of these optimisations include: - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. - Call `reshape` rather than `expect_contiguous` + `view` On top of these, it fixes a correctness issue of `matmul_out`, where the out parameter was not resized correctly when passed to the backends. This involves removing the use of `set_` from the calling code, as requested by ezyang, and it incurs on most of the complexity of the code that this PR adds. [ghstack-poisoned]
This PR implements the bulk of #64387 Part of the optimisations were already merged in #72230 A number of these optimisations include: - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. - Call `reshape` rather than `expect_contiguous` + `view` On top of these, it fixes a correctness issue of `matmul_out`, where the out parameter was not resized correctly when passed to the backends. This involves removing the use of `set_` from the calling code, as requested by ezyang, and it incurs on most of the complexity of the code that this PR adds. ghstack-source-id: 97a8759 Pull Request resolved: #75197
Stack from ghstack:
A number of these include:
DimVectoroverstd::vectorwhen handling dimensions.constcorrect.DimVector's more efficiently (e.g. preferappendoverinsert).sizes().front()/sizes().back()/
sizes().end()[-2]mvrather thanmmin the case(n,) x (n, m).reshaperather thanexpect_contiguous+viewOn top of this, while fixing the CI and after further discussions, the following features were added.
mv,matmul,__rmatmul___out.set_. (requested by @ezyang )matmul_out.Looking into the future, further optimisations could include dispatching to lower level functions
when one of the inputs has shape
(n, 1)or(1, n).Fixes #67767
cc @VitalyFedyunin @ngimel @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano @heitorschueroff