Migrate addmm, addbmm and THBlas_gemm to ATen by peterbell10 · Pull Request #40927 · pytorch/pytorch

peterbell10 · 2020-07-02T17:31:01Z

addbmm depends on addmm so needed to be ported at the same time. I also removed THTensor_(baddbmm) which I noticed had already been ported so was just dead code.

After having already written this code, I had to fix merge conflicts with #40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style.

dr-ci · 2020-07-02T17:38:41Z

💊 CI failures summary and remediations

As of commit 0f8a3f0 (more details on the Dr. CI page):

1/2 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
1/2 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build (1/1)

Step: "Install unbuffer and ts" (full log | diagnosis details | 🔁 rerun) ❄️

E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/Packages  Writing more data than expected (1196016 > 1190062)

0% [Working] 0% [8 InRelease gpgv 247 kB] [Waiting for headers]                                                    Ign:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64  Packages 
 0% [8 InRelease gpgv 247 kB] [Waiting for headers]                                                    Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64  Packages [276 kB] 
 0% [8 InRelease gpgv 247 kB] [Waiting for headers] [9 Packages 0 B/298 kB 0%]                                                                               Ign:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64  Packages 
                                                                              0% [8 InRelease gpgv 247 kB] [Waiting for headers]                                                    Hit:10 http://archive.ubuntu.com/ubuntu xenial-updates InRelease 
0% [Waiting for headers] 0% [10 InRelease gpgv 109 kB] [Waiting for headers]                                                     Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64  Packages [1,190 kB] 
 0% [10 InRelease gpgv 109 kB] [Waiting for headers] [9 Packages 16.4 kB/1,262 k                                                                                 Err:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64  Packages 
  Writing more data than expected (1196016 > 1190062) 
                                                    0% [Waiting for headers]                          Hit:11 http://archive.ubuntu.com/ubuntu xenial-backports InRelease 
                              20% [Working]               Fetched 836 B in 0s (1,354 B/s) 
Reading package lists... 99%  Reading package lists... Done  
E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/Packages  Writing more data than expected (1196016 > 1190062) 
E: Some index files failed to download. They have been ignored, or old ones used instead.

Extra GitHub checks: 1 failed

Failed: GitHub Actions - clang-tidy

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 15 times.

vadimkantorov · 2020-07-03T22:13:57Z

Does addmm support 3d (4d, 5d,...) input tensors? The usecase is #39661

peterbell10 · 2020-07-04T14:56:51Z

addmm will raise if the inputs are not 2d tensors, the same as the TH version. Isn't 3d (batched) gemm what torch.baddbmm does already?

vadimkantorov · 2020-07-04T18:20:00Z

I guess addbmm requires both input and the matrix have the same batch dimension which breaks that usecase. In that usecase, input tensor should have arbitrary batch dimension, and the transform should be just a matrix.

vadimkantorov · 2020-07-04T19:20:42Z

Sorry, I got confused. I see that you are referring to baddbmm. I need to check if broadcasting works well for this case.

But for >3d shaped tensors, F.linear would have to still to reshape input and output. Basically, we don't have for now a simple [B1xB2x...xC] @ [CxZ] + [Z] fused matmul+bias method. Matmul works with arbitrary shapes, but doesn't fuse with bias. baddbmm fuses with bias, but only works for 3d tensors.

ezyang · 2020-07-06T18:41:25Z

I suggest we leave this enhancement for a follow up PR.

ezyang · 2020-07-07T16:23:20Z


        # TODO: update this once torch.addmm is supported for complex
-        if dtype.is_complex:
+        if dtype.is_complex and device != 'cpu':


cc @anjali411

ezyang · 2020-07-07T16:24:51Z

@@ -0,0 +1,250 @@
+#include <ATen/native/CPUBlas.h>


Cross-reference with aten/src/TH/generic/THBlas.cpp

ezyang · 2020-07-07T16:25:54Z

It looks we got a little extra feature bonus, which is that complex gemm now works on CPU

ezyang · 2020-07-07T16:26:37Z

+namespace cpublas {
+namespace {
+
+void normalize_last_dims(


New helper function refactored out of THBlas_(gemm) and related functions

ezyang · 2020-07-07T16:29:10Z

+  }
+}
+
+bool use_blas_gemm(


Cross-reference https://github.com/pytorch/pytorch/pull/40927/files#diff-f3463a867abcbbaee7b5a5d2534c9940L223

ezyang · 2020-07-07T16:30:07Z

+  switch (trans) {
+  case Transpose: return 't';
+  case NoTranspose: return 'n';
+  // case ConjTranspose: return 'c';


cc @anjali411 we're probably going to want to expose this at some point

yeah good point! noted

ezyang · 2020-07-07T17:14:47Z

+
+DEFINE_DISPATCH(gemm_stub);
+
+void gemm(


Cross reference with https://github.com/pytorch/pytorch/pull/40927/files#diff-f3463a867abcbbaee7b5a5d2534c9940L180

ezyang · 2020-07-07T17:36:06Z

@@ -0,0 +1,197 @@
+#include <ATen/Dispatch.h>
+#include <ATen/native/CPUBlas.h>


Are you sure you actually want to vectorize these fallbacks? They're so naive I'm not sure they're worth the binary size to compile them with AVX/etc.

Do you know if we have any test coverage for this code?

BFloat16, Half and ints < 64 all unconditionally exercise this code path. Only BFloat16 and Half seem to actually have test coverage though.

Are you sure you actually want to vectorize these fallbacks? They're so naive I'm not sure they're worth the binary size to compile them with AVX/etc.

There's certainly a lot of room for improvement but at the very least, I think a.T @ b and a @ b.T should vectorize very well and be reasonably efficient.

ezyang · 2020-07-07T17:37:24Z

+
+
+template <typename scalar_t>
+void gemm_notrans_(


cross reference https://github.com/pytorch/pytorch/pull/40927/files#diff-f3463a867abcbbaee7b5a5d2534c9940L277

ezyang · 2020-07-07T17:38:40Z

+namespace {
+
+template <typename scalar_t>
+void scale_(int64_t m, int64_t n, scalar_t alpha, scalar_t *a, int64_t lda) {


The renaming of beta to alpha here is confusing (no action necessary)

ezyang · 2020-07-07T17:42:04Z

+  // c *= beta
+  scale_(m, n, beta, c, ldc);
+
+  // c += alpha * (a @ b.T)


These comments are great, thanks!

ezyang · 2020-07-07T18:03:42Z

+    Tensor &result, const Tensor &self, Tensor m1, Tensor m2, Scalar beta, Scalar alpha) {
+  TORCH_CHECK(self.dim() == 2, "input must be a matrix");
+  TORCH_CHECK(m1.dim() == 2, "m1 must be a matrix");
+  TORCH_CHECK(m2.dim() == 2, "m2 must be a matrix");


This is a slight pessimization over the old error checking code, which would tell you what the dimension of input/m1/m2 were (btw, we should use names consistent with the Python documentation, which are input, mat1 and mat2

ezyang · 2020-07-07T18:05:16Z

+  TORCH_CHECK(
+      self.size(0) == m1.size(0) && self.size(1) == m2.size(1),
+      "input shape is incompatible with matrix multiplication (",
+      m1.size(0), "x", m1.size(1), " and ", m2.size(0), "x", m2.size(1), ")");


This is bad. You need to report self size.

facebook-github-bot · 2020-07-09T02:16:28Z

@ezyang merged this pull request in 6725c03.

facebook-github-bot · 2020-07-09T02:17:20Z

@ezyang merged this pull request in 6725c03.

ezyang · 2020-07-09T14:44:29Z

Diff appears to have regressed performance in prod, unlanding. I'm asking the reporter for more information.

ezyang · 2020-07-09T14:46:27Z

@peterbell10 In the mean time, if you could run some before and after benchmarks on these functions, that may also be helpful in pinning down the regression.

peterbell10 · 2020-07-09T15:55:51Z

I've run the operator benchmarks for addmm and there is no obvious regression. Perhaps a us here and there, hard to discern from the noise floor. However, that benchmark doesn't really cover all the edge cases so I could be missing it.

Some information that would be useful if possible:

How big of a slow down are we talking? Maybe BLAS isn't getting called (disaster), or is it just a small overhead introduced that's unacceptable for such a common operator.
Are they using addmm, addbmm or matmul?
Which dtype, is it one of the BLAS accelerated ones or not?
What shape are their tensors, and are they contiguous?

ngimel · 2020-07-09T16:17:34Z

@peterbell10 we've recently merged a new benchmarking utility that would allow you to generate random inputs for your functions to get a better coverage of performance. PR #38338 comes with example of benchmarking "before" and "after" builds by creating separate environments, and other comprehensive examples, please take a look.

ezyang · 2020-07-09T16:32:13Z

Maybe BLAS isn't getting called (disaster), or is it just a small overhead introduced that's unacceptable for such a common operator.

Their profile is a little hard to read, but it looks like a 3x end-to-end slowdown. Should be really obvious.

ezyang · 2020-07-09T17:38:56Z

Here are the top five mm/matmul shape sizes on the relevant benchmark:

 mm | [[256,2048],[2048,512]]
 mm | [[256,512],[512,2048]]
 mm | [[256,512],[512,512]]
 mm | [[1,540],[540,1024]]
 matmul | [[256,1,512],[512,2048]]

peterbell10 · 2020-07-09T18:51:38Z

I've now tried fuzzing the tensor shapes, as well as using the exact shapes given. Neither show any performance regression on my system. I even tried all the combinations of C and Fortran-contiguous inputs, neither of which made any significant difference. Certainly not a 3x performance drop. I also tried with different dtypes and the only thing I see is a slight performance improvement for the non-BLAS cases.

ngimel · 2020-07-09T18:55:59Z

Yeah, the performance regressions we are seeing on op bench internally (e.g. for 128x128x128) are disastrous, looks like something goes wrong and we don't use blas.

ezyang · 2020-07-10T21:31:22Z

@ngimel identified the problem as an fbcode specific problem, and has relanded the diff.

Summary: Resubmit #40927 Closes #24679, closes #24678 `addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code. After having already written this code, I had to fix merge conflicts with #40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style. Pull Request resolved: #40927 Reviewed By: ezyang Differential Revision: D22468490 Pulled By: ngimel fbshipit-source-id: f8a22be3216f67629420939455e31a88af20201d

ngimel · 2020-07-13T18:07:29Z

@peterbell10, out internal runs of op bench flagged regressions on 1x1x1 matmuls (linear_N1_IN1_OUT1_cpu_Eager and matmul_M1_N1_K1_trans_aTrue_trans_bFalse_cpu_Eager) which means that overhead increased after migration to ATen. Can you please check if you can reproduce these regressions?

Summary: Fixes the overhead reported by ngimel in #40927 (comment) As it turns out, `Tensor.size(n)` has more overhead than `Tensor.sizes()[n]`. Since addmm does a lot of introspection of the input matrix sizes and strides, this added up to a noticeable (~1 us) constant time overhead. With this change, a 1x1 matmul takes 2.85 us on my machine compared to 2.90 us on pytorch 1.5. Pull Request resolved: #41374 Reviewed By: ailzhang Differential Revision: D22519924 Pulled By: ngimel fbshipit-source-id: b29504bee7de79ce42e5e50f91523dde42b073b7

Summary: I noticed that `TensorIteratorDynamicCasting.h` defines a helper meta-function `CPPTypeToScalarType` which does exactly the same thing as the `c10::CppTypeToScalarType` meta-function I added in gh-40927. No need for two identical definitions. Pull Request resolved: #42640 Reviewed By: malfet Differential Revision: D22969708 Pulled By: ezyang fbshipit-source-id: 8303c7f4a75ae248f393a4811ae9d2bcacab44ff

Summary: Closes pytorch#24679, closes pytorch#24678 `addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code. After having already written this code, I had to fix merge conflicts with pytorch#40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style. Pull Request resolved: pytorch#40927 Differential Revision: D22418756 Pulled By: ezyang fbshipit-source-id: 44e7bb5964263d73ae8cc6adc5f6d4e966476ae6

Summary: Resubmit pytorch#40927 Closes pytorch#24679, closes pytorch#24678 `addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code. After having already written this code, I had to fix merge conflicts with pytorch#40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style. Pull Request resolved: pytorch#40927 Reviewed By: ezyang Differential Revision: D22468490 Pulled By: ngimel fbshipit-source-id: f8a22be3216f67629420939455e31a88af20201d

Summary: Fixes the overhead reported by ngimel in pytorch#40927 (comment) As it turns out, `Tensor.size(n)` has more overhead than `Tensor.sizes()[n]`. Since addmm does a lot of introspection of the input matrix sizes and strides, this added up to a noticeable (~1 us) constant time overhead. With this change, a 1x1 matmul takes 2.85 us on my machine compared to 2.90 us on pytorch 1.5. Pull Request resolved: pytorch#41374 Reviewed By: ailzhang Differential Revision: D22519924 Pulled By: ngimel fbshipit-source-id: b29504bee7de79ce42e5e50f91523dde42b073b7

Summary: I noticed that `TensorIteratorDynamicCasting.h` defines a helper meta-function `CPPTypeToScalarType` which does exactly the same thing as the `c10::CppTypeToScalarType` meta-function I added in pytorchgh-40927. No need for two identical definitions. Pull Request resolved: pytorch#42640 Reviewed By: malfet Differential Revision: D22969708 Pulled By: ezyang fbshipit-source-id: 8303c7f4a75ae248f393a4811ae9d2bcacab44ff

peterbell10 added the open source label Jul 2, 2020

peterbell10 requested a review from ezyang July 2, 2020 17:31

peterbell10 added 2 commits July 2, 2020 18:52

Migrate addmm and THBlas_gemm to ATen

9efab03

Revert use of cblas interface

c35df84

peterbell10 force-pushed the addmm-aten branch from 2c24fd5 to c35df84 Compare July 2, 2020 18:58

Remove stale include

ac4b889

anjali411 mentioned this pull request Jul 6, 2020

Basic linear algebra for complex numbers pytorch/audio#768

Closed

ngimel self-requested a review July 7, 2020 00:27

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 7, 2020

ezyang reviewed Jul 7, 2020

View reviewed changes

Comment thread aten/src/ATen/native/CPUBlas.cpp Outdated

ezyang reviewed Jul 7, 2020

View reviewed changes

Comment thread aten/src/ATen/native/CPUBlas.cpp Outdated

ezyang reviewed Jul 7, 2020

View reviewed changes

anjali411 mentioned this pull request Jul 8, 2020

[v1.6.0] Release Tracker #40472

Closed

facebook-github-bot closed this in 6725c03 Jul 9, 2020

facebook-github-bot added the merged label Jul 9, 2020

vadimkantorov mentioned this pull request Jul 10, 2020

[proposal] Fuse mm+add for inputs > 2D #39661

Closed

peterbell10 mentioned this pull request Jul 10, 2020

WIP: Migrate addmm, addbmm and THBlas_gemm to ATen #41251

Closed

peterbell10 mentioned this pull request Jul 13, 2020

addmm: Reduce constant time overhead #41374

Closed

gchanan mentioned this pull request Jul 30, 2020

cudaErrorIllegalAddress printing result of torch.nn.Linear(1, 1).cuda()(torch.Tensor([[0.5]])) #42282

Closed

peterbell10 mentioned this pull request Aug 5, 2020

Remove duplicate definitions of CppTypeToScalarType #42640

Closed

mruberry added the Merged label Oct 28, 2020

		@@ -0,0 +1,197 @@
		#include <ATen/Dispatch.h>
		#include <ATen/native/CPUBlas.h>


		DEFINE_DISPATCH(gemm_stub);

		void gemm(

Conversation

peterbell10 commented Jul 2, 2020

Uh oh!

dr-ci Bot commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

❄️ 1 failure tentatively classified as flaky

binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build (1/1)

Extra GitHub checks: 1 failed

Uh oh!

vadimkantorov commented Jul 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterbell10 commented Jul 4, 2020

Uh oh!

vadimkantorov commented Jul 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadimkantorov commented Jul 4, 2020

Uh oh!

ezyang commented Jul 6, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented Jul 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterbell10 Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 9, 2020

Uh oh!

facebook-github-bot commented Jul 9, 2020

Uh oh!

ezyang commented Jul 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Jul 9, 2020

Uh oh!

peterbell10 commented Jul 9, 2020

Uh oh!

ngimel commented Jul 9, 2020

Uh oh!

ezyang commented Jul 9, 2020

Uh oh!

ezyang commented Jul 9, 2020

Uh oh!

peterbell10 commented Jul 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Jul 9, 2020

Uh oh!

dr-ci Bot commented Jul 2, 2020 •

edited

Loading

vadimkantorov commented Jul 3, 2020 •

edited

Loading

vadimkantorov commented Jul 4, 2020 •

edited

Loading

peterbell10 Jul 7, 2020 •

edited

Loading

ezyang commented Jul 9, 2020 •

edited

Loading

peterbell10 commented Jul 9, 2020 •

edited

Loading