Avoid copies in matmul by lezcano · Pull Request #76828 · pytorch/pytorch

lezcano · 2022-05-04T18:08:13Z

Stack from ghstack (oldest at bottom):

-> Avoid copies in matmul #76828

With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

Fixes #76702

cc @ngimel @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 [ghstack-poisoned]

facebook-github-bot · 2022-05-04T18:08:20Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/76828
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

❌ 8 New Failures

As of commit 40e4c01 (more details on the Dr. CI page):

Expand to see more

8/8 failures introduced in this PR

🕵️ 8 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu) (1/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T09:56:27.6245981Z FAIL [0.000s]: tes...ue, need_weights=True, average_attn_weights=False)

2022-08-17T09:56:27.6242760Z     assert_equal(
2022-08-17T09:56:27.6243226Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T09:56:27.6243620Z     raise error_metas[0].to_error(msg)
2022-08-17T09:56:27.6243999Z AssertionError: Tensor-likes are not close!
2022-08-17T09:56:27.6244205Z 
2022-08-17T09:56:27.6244338Z Mismatched elements: 47 / 8192 (0.6%)
2022-08-17T09:56:27.6244682Z Greatest absolute difference: 0.0263671875 at index (10, 0, 27) (up to 0.001 allowed)
2022-08-17T09:56:27.6245098Z Greatest relative difference: 0.2673588578844906 at index (10, 0, 41) (up to 0.001 allowed)
2022-08-17T09:56:27.6245332Z 
2022-08-17T09:56:27.6245478Z ======================================================================
2022-08-17T09:56:27.6245981Z FAIL [0.000s]: test_native_multihead_self_attention_cuda_float32 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=False, use_nt=True, need_weights=True, average_attn_weights=False)
2022-08-17T09:56:27.6246613Z ----------------------------------------------------------------------
2022-08-17T09:56:27.6247049Z Traceback (most recent call last):
2022-08-17T09:56:27.6247476Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 274, in test_native_multihead_self_attention
2022-08-17T09:56:27.6247862Z     self._test_multihead_attention_impl(
2022-08-17T09:56:27.6248276Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 252, in _test_multihead_attention_impl
2022-08-17T09:56:27.6248859Z     torch.testing.assert_close(ypt, ynpt, atol=2e-5, rtol=2e-3)
2022-08-17T09:56:27.6249396Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1342, in assert_close
2022-08-17T09:56:27.6249767Z     assert_equal(
2022-08-17T09:56:27.6250243Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T09:56:27.6250643Z     raise error_metas[0].to_error(msg)

periodic / linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck / test (default, 2, 2, linux.4xlarge.nvidia.gpu) (2/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T11:17:52.7225417Z FAIL [0.000s]: tes...ue, need_weights=True, average_attn_weights=False)

2022-08-17T11:17:52.7222173Z     msg=msg,
2022-08-17T11:17:52.7222641Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T11:17:52.7223011Z     raise error_metas[0].to_error(msg)
2022-08-17T11:17:52.7223377Z AssertionError: Tensor-likes are not close!
2022-08-17T11:17:52.7223641Z 
2022-08-17T11:17:52.7223781Z Mismatched elements: 47 / 8192 (0.6%)
2022-08-17T11:17:52.7224118Z Greatest absolute difference: 0.0263671875 at index (10, 0, 27) (up to 0.001 allowed)
2022-08-17T11:17:52.7224526Z Greatest relative difference: 0.2673588578844906 at index (10, 0, 41) (up to 0.001 allowed)
2022-08-17T11:17:52.7224755Z 
2022-08-17T11:17:52.7224898Z ======================================================================
2022-08-17T11:17:52.7225417Z FAIL [0.000s]: test_native_multihead_self_attention_cuda_float32 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=False, use_nt=True, need_weights=True, average_attn_weights=False)
2022-08-17T11:17:52.7226031Z ----------------------------------------------------------------------
2022-08-17T11:17:52.7226382Z Traceback (most recent call last):
2022-08-17T11:17:52.7226742Z   File "test_native_mha.py", line 282, in test_native_multihead_self_attention
2022-08-17T11:17:52.7227113Z     average_attn_weights=average_attn_weights,
2022-08-17T11:17:52.7227466Z   File "test_native_mha.py", line 252, in _test_multihead_attention_impl
2022-08-17T11:17:52.7227934Z     torch.testing.assert_close(ypt, ynpt, atol=2e-5, rtol=2e-3)
2022-08-17T11:17:52.7228481Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1359, in assert_close
2022-08-17T11:17:52.7228823Z     msg=msg,
2022-08-17T11:17:52.7229284Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T11:17:52.7229673Z     raise error_metas[0].to_error(msg)

periodic / win-vs2019-cuda11.7-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu) (3/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T11:26:40.3804229Z FAIL [0.079s]: tes...e, need_weights=False, average_attn_weights=False)

2022-08-17T11:26:40.3800917Z     assert_equal(
2022-08-17T11:26:40.3801416Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1093, in assert_equal
2022-08-17T11:26:40.3801976Z     raise error_metas[0].to_error(msg)
2022-08-17T11:26:40.3802292Z AssertionError: Tensor-likes are not close!
2022-08-17T11:26:40.3802481Z 
2022-08-17T11:26:40.3802602Z Mismatched elements: 18 / 8192 (0.2%)
2022-08-17T11:26:40.3802952Z Greatest absolute difference: 0.005859375 at index (12, 1, 18) (up to 0.001 allowed)
2022-08-17T11:26:40.3803344Z Greatest relative difference: 0.10690121786197564 at index (12, 1, 43) (up to 0.001 allowed)
2022-08-17T11:26:40.3803570Z 
2022-08-17T11:26:40.3803697Z ======================================================================
2022-08-17T11:26:40.3804229Z FAIL [0.079s]: test_native_multihead_self_attention_cuda_float16 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=True, use_nt=False, need_weights=False, average_attn_weights=False)
2022-08-17T11:26:40.3804803Z ----------------------------------------------------------------------
2022-08-17T11:26:40.3805137Z Traceback (most recent call last):
2022-08-17T11:26:40.3805662Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_native_mha.py", line 274, in test_native_multihead_self_attention
2022-08-17T11:26:40.3806119Z     self._test_multihead_attention_impl(
2022-08-17T11:26:40.3806632Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_native_mha.py", line 247, in _test_multihead_attention_impl
2022-08-17T11:26:40.3807116Z     torch.testing.assert_close(ypt, ynpt, atol=1e-3, rtol=1e-3)
2022-08-17T11:26:40.3807684Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1342, in assert_close
2022-08-17T11:26:40.3808124Z     assert_equal(
2022-08-17T11:26:40.3808611Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1093, in assert_equal
2022-08-17T11:26:40.3809091Z     raise error_metas[0].to_error(msg)

periodic / linux-bionic-cuda11.6-py3.7-gcc7-debug / test (default, 3, 4, linux.4xlarge.nvidia.gpu) (4/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T10:01:35.2525021Z FAIL [0.000s]: tes...ue, need_weights=True, average_attn_weights=False)

2022-08-17T10:01:35.2521761Z     assert_equal(
2022-08-17T10:01:35.2522332Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T10:01:35.2522715Z     raise error_metas[0].to_error(msg)
2022-08-17T10:01:35.2523082Z AssertionError: Tensor-likes are not close!
2022-08-17T10:01:35.2523284Z 
2022-08-17T10:01:35.2523414Z Mismatched elements: 47 / 8192 (0.6%)
2022-08-17T10:01:35.2523743Z Greatest absolute difference: 0.0263671875 at index (10, 0, 27) (up to 0.001 allowed)
2022-08-17T10:01:35.2524143Z Greatest relative difference: 0.2673588578844906 at index (10, 0, 41) (up to 0.001 allowed)
2022-08-17T10:01:35.2524372Z 
2022-08-17T10:01:35.2524510Z ======================================================================
2022-08-17T10:01:35.2525021Z FAIL [0.000s]: test_native_multihead_self_attention_cuda_float32 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=False, use_nt=True, need_weights=True, average_attn_weights=False)
2022-08-17T10:01:35.2525625Z ----------------------------------------------------------------------
2022-08-17T10:01:35.2525976Z Traceback (most recent call last):
2022-08-17T10:01:35.2526482Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 274, in test_native_multihead_self_attention
2022-08-17T10:01:35.2526873Z     self._test_multihead_attention_impl(
2022-08-17T10:01:35.2527259Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 252, in _test_multihead_attention_impl
2022-08-17T10:01:35.2527767Z     torch.testing.assert_close(ypt, ynpt, atol=2e-5, rtol=2e-3)
2022-08-17T10:01:35.2528312Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1342, in assert_close
2022-08-17T10:01:35.2528654Z     assert_equal(
2022-08-17T10:01:35.2529125Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T10:01:35.2529510Z     raise error_metas[0].to_error(msg)

periodic / linux-bionic-cuda11.7-py3.7-gcc7-debug / test (default, 3, 4, linux.4xlarge.nvidia.gpu) (5/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T10:06:39.2045453Z FAIL [0.000s]: tes...ue, need_weights=True, average_attn_weights=False)

2022-08-17T10:06:39.2042229Z     assert_equal(
2022-08-17T10:06:39.2042699Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T10:06:39.2043154Z     raise error_metas[0].to_error(msg)
2022-08-17T10:06:39.2043512Z AssertionError: Tensor-likes are not close!
2022-08-17T10:06:39.2043714Z 
2022-08-17T10:06:39.2043843Z Mismatched elements: 47 / 8192 (0.6%)
2022-08-17T10:06:39.2044191Z Greatest absolute difference: 0.0263671875 at index (10, 0, 27) (up to 0.001 allowed)
2022-08-17T10:06:39.2044580Z Greatest relative difference: 0.2673588578844906 at index (10, 0, 41) (up to 0.001 allowed)
2022-08-17T10:06:39.2044808Z 
2022-08-17T10:06:39.2044943Z ======================================================================
2022-08-17T10:06:39.2045453Z FAIL [0.000s]: test_native_multihead_self_attention_cuda_float32 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=False, use_nt=True, need_weights=True, average_attn_weights=False)
2022-08-17T10:06:39.2046074Z ----------------------------------------------------------------------
2022-08-17T10:06:39.2046401Z Traceback (most recent call last):
2022-08-17T10:06:39.2046804Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 274, in test_native_multihead_self_attention
2022-08-17T10:06:39.2047204Z     self._test_multihead_attention_impl(
2022-08-17T10:06:39.2047593Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 252, in _test_multihead_attention_impl
2022-08-17T10:06:39.2048095Z     torch.testing.assert_close(ypt, ynpt, atol=2e-5, rtol=2e-3)
2022-08-17T10:06:39.2048643Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1342, in assert_close
2022-08-17T10:06:39.2049006Z     assert_equal(
2022-08-17T10:06:39.2049459Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T10:06:39.2049843Z     raise error_metas[0].to_error(msg)

trunk / win-vs2019-cuda11.6-py3 / test (default, 4, 5, windows.8xlarge.nvidia.gpu) (6/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T10:30:31.1001788Z FAIL [0.088s]: tes...e, need_weights=False, average_attn_weights=False)

2022-08-17T10:30:31.0997582Z     assert_equal(
2022-08-17T10:30:31.0998030Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1093, in assert_equal
2022-08-17T10:30:31.0998579Z     raise error_metas[0].to_error(msg)
2022-08-17T10:30:31.0998990Z AssertionError: Tensor-likes are not close!
2022-08-17T10:30:31.0999180Z 
2022-08-17T10:30:31.0999296Z Mismatched elements: 18 / 8192 (0.2%)
2022-08-17T10:30:31.0999711Z Greatest absolute difference: 0.005859375 at index (12, 1, 18) (up to 0.001 allowed)
2022-08-17T10:30:31.1000116Z Greatest relative difference: 0.10690121786197564 at index (12, 1, 43) (up to 0.001 allowed)
2022-08-17T10:30:31.1000329Z 
2022-08-17T10:30:31.1001228Z ======================================================================
2022-08-17T10:30:31.1001788Z FAIL [0.088s]: test_native_multihead_self_attention_cuda_float16 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=True, use_nt=False, need_weights=False, average_attn_weights=False)
2022-08-17T10:30:31.1002456Z ----------------------------------------------------------------------
2022-08-17T10:30:31.1002782Z Traceback (most recent call last):
2022-08-17T10:30:31.1003378Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_native_mha.py", line 274, in test_native_multihead_self_attention
2022-08-17T10:30:31.1003807Z     self._test_multihead_attention_impl(
2022-08-17T10:30:31.1004637Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_native_mha.py", line 247, in _test_multihead_attention_impl
2022-08-17T10:30:31.1005098Z     torch.testing.assert_close(ypt, ynpt, atol=1e-3, rtol=1e-3)
2022-08-17T10:30:31.1005735Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1342, in assert_close
2022-08-17T10:30:31.1006202Z     assert_equal(
2022-08-17T10:30:31.1006760Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1093, in assert_equal
2022-08-17T10:30:31.1007194Z     raise error_metas[0].to_error(msg)

periodic / linux-bionic-cuda11.6-py3.7-gcc7-debug / test (default, 2, 4, linux.4xlarge.nvidia.gpu) (7/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T10:20:30.4954265Z Build left local git repository checkout dirty

2022-08-17T10:20:29.2621340Z real	77m31.696s
2022-08-17T10:20:29.2621708Z user	93m35.916s
2022-08-17T10:20:29.2621969Z sys	6m36.407s
2022-08-17T10:20:29.2622234Z + assert_git_not_dirty
2022-08-17T10:20:29.2623273Z + [[ linux-bionic-cuda11.6-py3.7-gcc7-debug != *rocm* ]]
2022-08-17T10:20:29.2624008Z + [[ linux-bionic-cuda11.6-py3.7-gcc7-debug != *xla* ]]
2022-08-17T10:20:29.2624878Z ++ git status --porcelain
2022-08-17T10:20:30.4952414Z + git_status=' M aten/src/ATen/native/LinearAlgebra.cpp'
2022-08-17T10:20:30.4953330Z + [[ -n  M aten/src/ATen/native/LinearAlgebra.cpp ]]
2022-08-17T10:20:30.4953876Z + echo 'Build left local git repository checkout dirty'
2022-08-17T10:20:30.4954265Z Build left local git repository checkout dirty
2022-08-17T10:20:30.4954649Z + echo 'git status --porcelain:'
2022-08-17T10:20:30.4954965Z git status --porcelain:
2022-08-17T10:20:30.4955354Z + echo ' M aten/src/ATen/native/LinearAlgebra.cpp'
2022-08-17T10:20:30.4955713Z  M aten/src/ATen/native/LinearAlgebra.cpp
2022-08-17T10:20:30.4955979Z + exit 1
2022-08-17T10:20:30.5019380Z ##[error]Process completed with exit code 1.
2022-08-17T10:20:30.5068781Z Prepare all required actions
2022-08-17T10:20:30.5069164Z Getting action download info
2022-08-17T10:20:30.7417329Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-17T10:20:30.7417655Z with:

pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu) (8/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T10:17:10.0129454Z FAIL [0.073s]: tes...ue, need_weights=True, average_attn_weights=False)

2022-08-17T10:17:10.0125776Z     assert_equal(
2022-08-17T10:17:10.0126260Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T10:17:10.0126614Z     raise error_metas[0].to_error(msg)
2022-08-17T10:17:10.0126976Z AssertionError: Tensor-likes are not close!
2022-08-17T10:17:10.0127556Z 
2022-08-17T10:17:10.0127738Z Mismatched elements: 47 / 8192 (0.6%)
2022-08-17T10:17:10.0128089Z Greatest absolute difference: 0.0263671875 at index (10, 0, 27) (up to 0.001 allowed)
2022-08-17T10:17:10.0128466Z Greatest relative difference: 0.2673588578844906 at index (10, 0, 41) (up to 0.001 allowed)
2022-08-17T10:17:10.0128690Z 
2022-08-17T10:17:10.0128825Z ======================================================================
2022-08-17T10:17:10.0129454Z FAIL [0.073s]: test_native_multihead_self_attention_cuda_float32 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=False, use_nt=True, need_weights=True, average_attn_weights=False)
2022-08-17T10:17:10.0130083Z ----------------------------------------------------------------------
2022-08-17T10:17:10.0130407Z Traceback (most recent call last):
2022-08-17T10:17:10.0130806Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 274, in test_native_multihead_self_attention
2022-08-17T10:17:10.0131193Z     self._test_multihead_attention_impl(
2022-08-17T10:17:10.0131572Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 252, in _test_multihead_attention_impl
2022-08-17T10:17:10.0132072Z     torch.testing.assert_close(ypt, ynpt, atol=2e-5, rtol=2e-3)
2022-08-17T10:17:10.0132612Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1342, in assert_close
2022-08-17T10:17:10.0132967Z     assert_equal(
2022-08-17T10:17:10.0133420Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T10:17:10.0134211Z     raise error_metas[0].to_error(msg)

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 ghstack-source-id: c42f61a Pull Request resolved: #76828

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 [ghstack-poisoned]

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 ghstack-source-id: d78c6bb Pull Request resolved: #76828

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 [ghstack-poisoned]

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) Fixes #76702 [ghstack-poisoned]

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 ghstack-source-id: 7593983 Pull Request resolved: #76828

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) Fixes #76702 cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano [ghstack-poisoned]

… the backward of matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]

… matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 ghstack-source-id: 835ab33 Pull Request resolved: #76828

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) Fixes #76702 cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano [ghstack-poisoned]

… the backward of matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]

… matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 ghstack-source-id: 499c931 Pull Request resolved: #76828

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) Fixes #76702 cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano [ghstack-poisoned]

… the backward of matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]

… matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 ghstack-source-id: 60605e0 Pull Request resolved: #76828

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) Fixes #76702 cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano [ghstack-poisoned]

lezcano · 2023-02-26T16:32:12Z

@pytorchbot merge

pytorchmergebot · 2023-02-26T16:33:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-02-26T20:57:44Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator

Details for Dev Infra team

Raised by workflow job

lezcano · 2023-02-26T20:58:16Z

@pytorchbot merge -f "unrelated errors"

pytorchmergebot · 2023-02-26T20:59:58Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-02-26T21:00:09Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator

Details for Dev Infra team

Raised by workflow job

lezcano · 2023-02-27T11:03:56Z

@ngimel @ezyang could you help landing this one?

lezcano · 2023-02-27T15:22:04Z

@pytorchbot merge -f "unrelated errors"

pytorchmergebot · 2023-02-27T15:24:49Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

facebook-github-bot · 2023-03-08T00:41:51Z

This pull request has been reverted by 8c8148c. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).

With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) Fixes #76702 cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano [ghstack-poisoned]

lezcano · 2023-05-10T08:26:06Z

Merged in #97355

lezcano requested review from IvanYashchuk and nikitaved as code owners May 4, 2022 18:08

This was referenced May 4, 2022

Refactor the API of the matmul implementation #75194

Closed

Fix mv/addmv on CUDA when dealing with vectors of size=1 and stride=0 #75279

Closed

facebook-github-bot added the cla signed label May 4, 2022

lezcano mentioned this pull request May 4, 2022

Dispatch to mv rather than mm in the case that tensor1.ndim == 1 and tensor2.ndim == 2 #75195

Closed

This was referenced May 4, 2022

Go through the dispatcher in matmul_out for the 1D-1D case #75196

Closed

Micro-optimisations for matmul 2.0: Electric boogaloo #75197

Closed

lezcano added module: performance Issues related to performance, either of kernel code or framework glue module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul ciflow/all topic: performance topic category labels May 4, 2022

lezcano requested review from ngimel and removed request for IvanYashchuk and nikitaved May 4, 2022 18:13

pytorchbot added the open source label May 4, 2022

lezcano requested a review from ezyang as a code owner February 26, 2023 16:29

msaroufim mentioned this pull request Mar 3, 2023

Remove mention of dynamo.optimize() in docs #96002

Closed

seemethere mentioned this pull request Mar 6, 2023

Revert D43643526: Multisect successfully blamed D43643526 for test or build failures #96126

Closed

lezcano mentioned this pull request Mar 22, 2023

Avoid copies in matmul (no ghstack) #97355

Closed

lezcano mentioned this pull request Jun 8, 2024

Broadcasting matmul is much slower than corresponding einsum #110858

Open

Conversation

lezcano commented May 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented May 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

❌ 8 New Failures

🕵️ 8 new failures recognized by patterns

trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu) (1/8)

periodic / linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck / test (default, 2, 2, linux.4xlarge.nvidia.gpu) (2/8)

periodic / win-vs2019-cuda11.7-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu) (3/8)

periodic / linux-bionic-cuda11.6-py3.7-gcc7-debug / test (default, 3, 4, linux.4xlarge.nvidia.gpu) (4/8)

periodic / linux-bionic-cuda11.7-py3.7-gcc7-debug / test (default, 3, 4, linux.4xlarge.nvidia.gpu) (5/8)

trunk / win-vs2019-cuda11.6-py3 / test (default, 4, 5, windows.8xlarge.nvidia.gpu) (6/8)

periodic / linux-bionic-cuda11.6-py3.7-gcc7-debug / test (default, 2, 4, linux.4xlarge.nvidia.gpu) (7/8)

pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu) (8/8)

Uh oh!

lezcano commented Feb 26, 2023

Uh oh!

pytorchmergebot commented Feb 26, 2023

Merge started

Uh oh!

pytorchmergebot commented Feb 26, 2023

Merge failed

Uh oh!

lezcano commented Feb 26, 2023

Uh oh!

pytorchmergebot commented Feb 26, 2023

Merge started

Uh oh!

pytorchmergebot commented Feb 26, 2023

Merge failed

Uh oh!

lezcano commented Feb 27, 2023

Uh oh!

lezcano commented Feb 27, 2023

Uh oh!

pytorchmergebot commented Feb 27, 2023

Merge started

Uh oh!

facebook-github-bot commented Mar 8, 2023

Uh oh!

lezcano commented May 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

lezcano commented May 4, 2022 •

edited

Loading

facebook-github-bot commented May 4, 2022 •

edited

Loading