Skip to content

Avoid copies in matmul#76828

Closed
lezcano wants to merge 34 commits intogh/Lezcano/70/basefrom
gh/Lezcano/70/head
Closed

Avoid copies in matmul#76828
lezcano wants to merge 34 commits intogh/Lezcano/70/basefrom
gh/Lezcano/70/head

Conversation

@lezcano
Copy link
Collaborator

@lezcano lezcano commented May 4, 2022

Stack from ghstack (oldest at bottom):

With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

Fixes #76702

cc @ngimel @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano

With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

For the approach taken, see #75197 (comment)

Fixes #76702

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 4, 2022

🔗 Helpful links

❌ 8 New Failures

As of commit 40e4c01 (more details on the Dr. CI page):

Expand to see more
  • 8/8 failures introduced in this PR

🕵️ 8 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu) (1/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T09:56:27.6245981Z FAIL [0.000s]: tes...ue, need_weights=True, average_attn_weights=False)
2022-08-17T09:56:27.6242760Z     assert_equal(
2022-08-17T09:56:27.6243226Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T09:56:27.6243620Z     raise error_metas[0].to_error(msg)
2022-08-17T09:56:27.6243999Z AssertionError: Tensor-likes are not close!
2022-08-17T09:56:27.6244205Z 
2022-08-17T09:56:27.6244338Z Mismatched elements: 47 / 8192 (0.6%)
2022-08-17T09:56:27.6244682Z Greatest absolute difference: 0.0263671875 at index (10, 0, 27) (up to 0.001 allowed)
2022-08-17T09:56:27.6245098Z Greatest relative difference: 0.2673588578844906 at index (10, 0, 41) (up to 0.001 allowed)
2022-08-17T09:56:27.6245332Z 
2022-08-17T09:56:27.6245478Z ======================================================================
2022-08-17T09:56:27.6245981Z FAIL [0.000s]: test_native_multihead_self_attention_cuda_float32 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=False, use_nt=True, need_weights=True, average_attn_weights=False)
2022-08-17T09:56:27.6246613Z ----------------------------------------------------------------------
2022-08-17T09:56:27.6247049Z Traceback (most recent call last):
2022-08-17T09:56:27.6247476Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 274, in test_native_multihead_self_attention
2022-08-17T09:56:27.6247862Z     self._test_multihead_attention_impl(
2022-08-17T09:56:27.6248276Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 252, in _test_multihead_attention_impl
2022-08-17T09:56:27.6248859Z     torch.testing.assert_close(ypt, ynpt, atol=2e-5, rtol=2e-3)
2022-08-17T09:56:27.6249396Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1342, in assert_close
2022-08-17T09:56:27.6249767Z     assert_equal(
2022-08-17T09:56:27.6250243Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T09:56:27.6250643Z     raise error_metas[0].to_error(msg)

See GitHub Actions build periodic / linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck / test (default, 2, 2, linux.4xlarge.nvidia.gpu) (2/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T11:17:52.7225417Z FAIL [0.000s]: tes...ue, need_weights=True, average_attn_weights=False)
2022-08-17T11:17:52.7222173Z     msg=msg,
2022-08-17T11:17:52.7222641Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T11:17:52.7223011Z     raise error_metas[0].to_error(msg)
2022-08-17T11:17:52.7223377Z AssertionError: Tensor-likes are not close!
2022-08-17T11:17:52.7223641Z 
2022-08-17T11:17:52.7223781Z Mismatched elements: 47 / 8192 (0.6%)
2022-08-17T11:17:52.7224118Z Greatest absolute difference: 0.0263671875 at index (10, 0, 27) (up to 0.001 allowed)
2022-08-17T11:17:52.7224526Z Greatest relative difference: 0.2673588578844906 at index (10, 0, 41) (up to 0.001 allowed)
2022-08-17T11:17:52.7224755Z 
2022-08-17T11:17:52.7224898Z ======================================================================
2022-08-17T11:17:52.7225417Z FAIL [0.000s]: test_native_multihead_self_attention_cuda_float32 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=False, use_nt=True, need_weights=True, average_attn_weights=False)
2022-08-17T11:17:52.7226031Z ----------------------------------------------------------------------
2022-08-17T11:17:52.7226382Z Traceback (most recent call last):
2022-08-17T11:17:52.7226742Z   File "test_native_mha.py", line 282, in test_native_multihead_self_attention
2022-08-17T11:17:52.7227113Z     average_attn_weights=average_attn_weights,
2022-08-17T11:17:52.7227466Z   File "test_native_mha.py", line 252, in _test_multihead_attention_impl
2022-08-17T11:17:52.7227934Z     torch.testing.assert_close(ypt, ynpt, atol=2e-5, rtol=2e-3)
2022-08-17T11:17:52.7228481Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1359, in assert_close
2022-08-17T11:17:52.7228823Z     msg=msg,
2022-08-17T11:17:52.7229284Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T11:17:52.7229673Z     raise error_metas[0].to_error(msg)

See GitHub Actions build periodic / win-vs2019-cuda11.7-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu) (3/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T11:26:40.3804229Z FAIL [0.079s]: tes...e, need_weights=False, average_attn_weights=False)
2022-08-17T11:26:40.3800917Z     assert_equal(
2022-08-17T11:26:40.3801416Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1093, in assert_equal
2022-08-17T11:26:40.3801976Z     raise error_metas[0].to_error(msg)
2022-08-17T11:26:40.3802292Z AssertionError: Tensor-likes are not close!
2022-08-17T11:26:40.3802481Z 
2022-08-17T11:26:40.3802602Z Mismatched elements: 18 / 8192 (0.2%)
2022-08-17T11:26:40.3802952Z Greatest absolute difference: 0.005859375 at index (12, 1, 18) (up to 0.001 allowed)
2022-08-17T11:26:40.3803344Z Greatest relative difference: 0.10690121786197564 at index (12, 1, 43) (up to 0.001 allowed)
2022-08-17T11:26:40.3803570Z 
2022-08-17T11:26:40.3803697Z ======================================================================
2022-08-17T11:26:40.3804229Z FAIL [0.079s]: test_native_multihead_self_attention_cuda_float16 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=True, use_nt=False, need_weights=False, average_attn_weights=False)
2022-08-17T11:26:40.3804803Z ----------------------------------------------------------------------
2022-08-17T11:26:40.3805137Z Traceback (most recent call last):
2022-08-17T11:26:40.3805662Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_native_mha.py", line 274, in test_native_multihead_self_attention
2022-08-17T11:26:40.3806119Z     self._test_multihead_attention_impl(
2022-08-17T11:26:40.3806632Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_native_mha.py", line 247, in _test_multihead_attention_impl
2022-08-17T11:26:40.3807116Z     torch.testing.assert_close(ypt, ynpt, atol=1e-3, rtol=1e-3)
2022-08-17T11:26:40.3807684Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1342, in assert_close
2022-08-17T11:26:40.3808124Z     assert_equal(
2022-08-17T11:26:40.3808611Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1093, in assert_equal
2022-08-17T11:26:40.3809091Z     raise error_metas[0].to_error(msg)

See GitHub Actions build periodic / linux-bionic-cuda11.6-py3.7-gcc7-debug / test (default, 3, 4, linux.4xlarge.nvidia.gpu) (4/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T10:01:35.2525021Z FAIL [0.000s]: tes...ue, need_weights=True, average_attn_weights=False)
2022-08-17T10:01:35.2521761Z     assert_equal(
2022-08-17T10:01:35.2522332Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T10:01:35.2522715Z     raise error_metas[0].to_error(msg)
2022-08-17T10:01:35.2523082Z AssertionError: Tensor-likes are not close!
2022-08-17T10:01:35.2523284Z 
2022-08-17T10:01:35.2523414Z Mismatched elements: 47 / 8192 (0.6%)
2022-08-17T10:01:35.2523743Z Greatest absolute difference: 0.0263671875 at index (10, 0, 27) (up to 0.001 allowed)
2022-08-17T10:01:35.2524143Z Greatest relative difference: 0.2673588578844906 at index (10, 0, 41) (up to 0.001 allowed)
2022-08-17T10:01:35.2524372Z 
2022-08-17T10:01:35.2524510Z ======================================================================
2022-08-17T10:01:35.2525021Z FAIL [0.000s]: test_native_multihead_self_attention_cuda_float32 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=False, use_nt=True, need_weights=True, average_attn_weights=False)
2022-08-17T10:01:35.2525625Z ----------------------------------------------------------------------
2022-08-17T10:01:35.2525976Z Traceback (most recent call last):
2022-08-17T10:01:35.2526482Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 274, in test_native_multihead_self_attention
2022-08-17T10:01:35.2526873Z     self._test_multihead_attention_impl(
2022-08-17T10:01:35.2527259Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 252, in _test_multihead_attention_impl
2022-08-17T10:01:35.2527767Z     torch.testing.assert_close(ypt, ynpt, atol=2e-5, rtol=2e-3)
2022-08-17T10:01:35.2528312Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1342, in assert_close
2022-08-17T10:01:35.2528654Z     assert_equal(
2022-08-17T10:01:35.2529125Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T10:01:35.2529510Z     raise error_metas[0].to_error(msg)

See GitHub Actions build periodic / linux-bionic-cuda11.7-py3.7-gcc7-debug / test (default, 3, 4, linux.4xlarge.nvidia.gpu) (5/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T10:06:39.2045453Z FAIL [0.000s]: tes...ue, need_weights=True, average_attn_weights=False)
2022-08-17T10:06:39.2042229Z     assert_equal(
2022-08-17T10:06:39.2042699Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T10:06:39.2043154Z     raise error_metas[0].to_error(msg)
2022-08-17T10:06:39.2043512Z AssertionError: Tensor-likes are not close!
2022-08-17T10:06:39.2043714Z 
2022-08-17T10:06:39.2043843Z Mismatched elements: 47 / 8192 (0.6%)
2022-08-17T10:06:39.2044191Z Greatest absolute difference: 0.0263671875 at index (10, 0, 27) (up to 0.001 allowed)
2022-08-17T10:06:39.2044580Z Greatest relative difference: 0.2673588578844906 at index (10, 0, 41) (up to 0.001 allowed)
2022-08-17T10:06:39.2044808Z 
2022-08-17T10:06:39.2044943Z ======================================================================
2022-08-17T10:06:39.2045453Z FAIL [0.000s]: test_native_multihead_self_attention_cuda_float32 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=False, use_nt=True, need_weights=True, average_attn_weights=False)
2022-08-17T10:06:39.2046074Z ----------------------------------------------------------------------
2022-08-17T10:06:39.2046401Z Traceback (most recent call last):
2022-08-17T10:06:39.2046804Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 274, in test_native_multihead_self_attention
2022-08-17T10:06:39.2047204Z     self._test_multihead_attention_impl(
2022-08-17T10:06:39.2047593Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 252, in _test_multihead_attention_impl
2022-08-17T10:06:39.2048095Z     torch.testing.assert_close(ypt, ynpt, atol=2e-5, rtol=2e-3)
2022-08-17T10:06:39.2048643Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1342, in assert_close
2022-08-17T10:06:39.2049006Z     assert_equal(
2022-08-17T10:06:39.2049459Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T10:06:39.2049843Z     raise error_metas[0].to_error(msg)

See GitHub Actions build trunk / win-vs2019-cuda11.6-py3 / test (default, 4, 5, windows.8xlarge.nvidia.gpu) (6/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T10:30:31.1001788Z FAIL [0.088s]: tes...e, need_weights=False, average_attn_weights=False)
2022-08-17T10:30:31.0997582Z     assert_equal(
2022-08-17T10:30:31.0998030Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1093, in assert_equal
2022-08-17T10:30:31.0998579Z     raise error_metas[0].to_error(msg)
2022-08-17T10:30:31.0998990Z AssertionError: Tensor-likes are not close!
2022-08-17T10:30:31.0999180Z 
2022-08-17T10:30:31.0999296Z Mismatched elements: 18 / 8192 (0.2%)
2022-08-17T10:30:31.0999711Z Greatest absolute difference: 0.005859375 at index (12, 1, 18) (up to 0.001 allowed)
2022-08-17T10:30:31.1000116Z Greatest relative difference: 0.10690121786197564 at index (12, 1, 43) (up to 0.001 allowed)
2022-08-17T10:30:31.1000329Z 
2022-08-17T10:30:31.1001228Z ======================================================================
2022-08-17T10:30:31.1001788Z FAIL [0.088s]: test_native_multihead_self_attention_cuda_float16 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=True, use_nt=False, need_weights=False, average_attn_weights=False)
2022-08-17T10:30:31.1002456Z ----------------------------------------------------------------------
2022-08-17T10:30:31.1002782Z Traceback (most recent call last):
2022-08-17T10:30:31.1003378Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_native_mha.py", line 274, in test_native_multihead_self_attention
2022-08-17T10:30:31.1003807Z     self._test_multihead_attention_impl(
2022-08-17T10:30:31.1004637Z   File "C:\actions-runner\_work\pytorch\pytorch\test\test_native_mha.py", line 247, in _test_multihead_attention_impl
2022-08-17T10:30:31.1005098Z     torch.testing.assert_close(ypt, ynpt, atol=1e-3, rtol=1e-3)
2022-08-17T10:30:31.1005735Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1342, in assert_close
2022-08-17T10:30:31.1006202Z     assert_equal(
2022-08-17T10:30:31.1006760Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1093, in assert_equal
2022-08-17T10:30:31.1007194Z     raise error_metas[0].to_error(msg)

See GitHub Actions build periodic / linux-bionic-cuda11.6-py3.7-gcc7-debug / test (default, 2, 4, linux.4xlarge.nvidia.gpu) (7/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T10:20:30.4954265Z Build left local git repository checkout dirty
2022-08-17T10:20:29.2621340Z real	77m31.696s
2022-08-17T10:20:29.2621708Z user	93m35.916s
2022-08-17T10:20:29.2621969Z sys	6m36.407s
2022-08-17T10:20:29.2622234Z + assert_git_not_dirty
2022-08-17T10:20:29.2623273Z + [[ linux-bionic-cuda11.6-py3.7-gcc7-debug != *rocm* ]]
2022-08-17T10:20:29.2624008Z + [[ linux-bionic-cuda11.6-py3.7-gcc7-debug != *xla* ]]
2022-08-17T10:20:29.2624878Z ++ git status --porcelain
2022-08-17T10:20:30.4952414Z + git_status=' M aten/src/ATen/native/LinearAlgebra.cpp'
2022-08-17T10:20:30.4953330Z + [[ -n  M aten/src/ATen/native/LinearAlgebra.cpp ]]
2022-08-17T10:20:30.4953876Z + echo 'Build left local git repository checkout dirty'
2022-08-17T10:20:30.4954265Z Build left local git repository checkout dirty
2022-08-17T10:20:30.4954649Z + echo 'git status --porcelain:'
2022-08-17T10:20:30.4954965Z git status --porcelain:
2022-08-17T10:20:30.4955354Z + echo ' M aten/src/ATen/native/LinearAlgebra.cpp'
2022-08-17T10:20:30.4955713Z  M aten/src/ATen/native/LinearAlgebra.cpp
2022-08-17T10:20:30.4955979Z + exit 1
2022-08-17T10:20:30.5019380Z ##[error]Process completed with exit code 1.
2022-08-17T10:20:30.5068781Z Prepare all required actions
2022-08-17T10:20:30.5069164Z Getting action download info
2022-08-17T10:20:30.7417329Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-17T10:20:30.7417655Z with:

See GitHub Actions build pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu) (8/8)

Step: "Test" (full log | diagnosis details)

2022-08-17T10:17:10.0129454Z FAIL [0.073s]: tes...ue, need_weights=True, average_attn_weights=False)
2022-08-17T10:17:10.0125776Z     assert_equal(
2022-08-17T10:17:10.0126260Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T10:17:10.0126614Z     raise error_metas[0].to_error(msg)
2022-08-17T10:17:10.0126976Z AssertionError: Tensor-likes are not close!
2022-08-17T10:17:10.0127556Z 
2022-08-17T10:17:10.0127738Z Mismatched elements: 47 / 8192 (0.6%)
2022-08-17T10:17:10.0128089Z Greatest absolute difference: 0.0263671875 at index (10, 0, 27) (up to 0.001 allowed)
2022-08-17T10:17:10.0128466Z Greatest relative difference: 0.2673588578844906 at index (10, 0, 41) (up to 0.001 allowed)
2022-08-17T10:17:10.0128690Z 
2022-08-17T10:17:10.0128825Z ======================================================================
2022-08-17T10:17:10.0129454Z FAIL [0.073s]: test_native_multihead_self_attention_cuda_float32 (__main__.TestMHADeviceTypeCUDA) (use_padding=True, pad_all=False, use_nt=True, need_weights=True, average_attn_weights=False)
2022-08-17T10:17:10.0130083Z ----------------------------------------------------------------------
2022-08-17T10:17:10.0130407Z Traceback (most recent call last):
2022-08-17T10:17:10.0130806Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 274, in test_native_multihead_self_attention
2022-08-17T10:17:10.0131193Z     self._test_multihead_attention_impl(
2022-08-17T10:17:10.0131572Z   File "/var/lib/jenkins/workspace/test/test_native_mha.py", line 252, in _test_multihead_attention_impl
2022-08-17T10:17:10.0132072Z     torch.testing.assert_close(ypt, ynpt, atol=2e-5, rtol=2e-3)
2022-08-17T10:17:10.0132612Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1342, in assert_close
2022-08-17T10:17:10.0132967Z     assert_equal(
2022-08-17T10:17:10.0133420Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-08-17T10:17:10.0134211Z     raise error_metas[0].to_error(msg)

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

lezcano added a commit that referenced this pull request May 4, 2022
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

For the approach taken, see #75197 (comment)

Fixes #76702

ghstack-source-id: c42f61a
Pull Request resolved: #76828
@lezcano lezcano added module: performance Issues related to performance, either of kernel code or framework glue module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul ciflow/all topic: performance topic category labels May 4, 2022
@lezcano lezcano requested review from ngimel and removed request for IvanYashchuk and nikitaved May 4, 2022 18:13
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

For the approach taken, see #75197 (comment)

Fixes #76702

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 5, 2022
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

For the approach taken, see #75197 (comment)

Fixes #76702

ghstack-source-id: d78c6bb
Pull Request resolved: #76828
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

For the approach taken, see #75197 (comment)

Fixes #76702

[ghstack-poisoned]
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

For the approach taken, see #75197 (comment)

Fixes #76702

[ghstack-poisoned]
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

For the approach taken, see #75197 (comment)

Fixes #76702

[ghstack-poisoned]
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

Fixes #76702

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 10, 2022
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

For the approach taken, see #75197 (comment)

Fixes #76702

ghstack-source-id: 7593983
Pull Request resolved: #76828
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

Fixes #76702

cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Feb 22, 2023
… the backward of matmul"


Currently, if we multiply a transposed batch of matrices with shape
[b, m, n] and a matrix with shape [n, k], when computing the gradient
of the matrix, we instantiate a matrix of shape [b, n, k]. This may be 
a very large matrix. Instead, we fold the batch of matrices into a
matrix, which avoids creating any large intermediary tensor.

Note that multiplying a batch of matrices and a matrix naturally occurs
within an attention module, so this case surely happens in the wild.
In particular, this issue was found while investigating the OOMs caused by the 
improved folding algorithm in the next PR of this stack. See #76828 (comment)
This PR fixes those OOMs and decreases the memory footprint of the
backward of matmul.

I understand this is a tricky one, so I put it on its own PR to discuss it.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Feb 22, 2023
… matmul"


Currently, if we multiply a transposed batch of matrices with shape
[b, m, n] and a matrix with shape [n, k], when computing the gradient
of the matrix, we instantiate a matrix of shape [b, n, k]. This may be 
a very large matrix. Instead, we fold the batch of matrices into a
matrix, which avoids creating any large intermediary tensor.

Note that multiplying a batch of matrices and a matrix naturally occurs
within an attention module, so this case surely happens in the wild.
In particular, this issue was found while investigating the OOMs caused by the 
improved folding algorithm in the next PR of this stack. See #76828 (comment)
This PR fixes those OOMs and decreases the memory footprint of the
backward of matmul.

I understand this is a tricky one, so I put it on its own PR to discuss it.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Feb 22, 2023
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

For the approach taken, see #75197 (comment)

Fixes #76702

ghstack-source-id: 835ab33
Pull Request resolved: #76828
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

Fixes #76702

cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Feb 22, 2023
… the backward of matmul"


Currently, if we multiply a transposed batch of matrices with shape
[b, m, n] and a matrix with shape [n, k], when computing the gradient
of the matrix, we instantiate a matrix of shape [b, n, k]. This may be 
a very large matrix. Instead, we fold the batch of matrices into a
matrix, which avoids creating any large intermediary tensor.

Note that multiplying a batch of matrices and a matrix naturally occurs
within an attention module, so this case surely happens in the wild.
In particular, this issue was found while investigating the OOMs caused by the 
improved folding algorithm in the next PR of this stack. See #76828 (comment)
This PR fixes those OOMs and decreases the memory footprint of the
backward of matmul.

I understand this is a tricky one, so I put it on its own PR to discuss it.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Feb 22, 2023
… matmul"


Currently, if we multiply a transposed batch of matrices with shape
[b, m, n] and a matrix with shape [n, k], when computing the gradient
of the matrix, we instantiate a matrix of shape [b, n, k]. This may be 
a very large matrix. Instead, we fold the batch of matrices into a
matrix, which avoids creating any large intermediary tensor.

Note that multiplying a batch of matrices and a matrix naturally occurs
within an attention module, so this case surely happens in the wild.
In particular, this issue was found while investigating the OOMs caused by the 
improved folding algorithm in the next PR of this stack. See #76828 (comment)
This PR fixes those OOMs and decreases the memory footprint of the
backward of matmul.

I understand this is a tricky one, so I put it on its own PR to discuss it.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Feb 22, 2023
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

For the approach taken, see #75197 (comment)

Fixes #76702

ghstack-source-id: 499c931
Pull Request resolved: #76828
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

Fixes #76702

cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Feb 23, 2023
… the backward of matmul"


Currently, if we multiply a transposed batch of matrices with shape
[b, m, n] and a matrix with shape [n, k], when computing the gradient
of the matrix, we instantiate a matrix of shape [b, n, k]. This may be 
a very large matrix. Instead, we fold the batch of matrices into a
matrix, which avoids creating any large intermediary tensor.

Note that multiplying a batch of matrices and a matrix naturally occurs
within an attention module, so this case surely happens in the wild.
In particular, this issue was found while investigating the OOMs caused by the 
improved folding algorithm in the next PR of this stack. See #76828 (comment)
This PR fixes those OOMs and decreases the memory footprint of the
backward of matmul.

I understand this is a tricky one, so I put it on its own PR to discuss it.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Feb 23, 2023
… matmul"


Currently, if we multiply a transposed batch of matrices with shape
[b, m, n] and a matrix with shape [n, k], when computing the gradient
of the matrix, we instantiate a matrix of shape [b, n, k]. This may be 
a very large matrix. Instead, we fold the batch of matrices into a
matrix, which avoids creating any large intermediary tensor.

Note that multiplying a batch of matrices and a matrix naturally occurs
within an attention module, so this case surely happens in the wild.
In particular, this issue was found while investigating the OOMs caused by the 
improved folding algorithm in the next PR of this stack. See #76828 (comment)
This PR fixes those OOMs and decreases the memory footprint of the
backward of matmul.

I understand this is a tricky one, so I put it on its own PR to discuss it.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Feb 23, 2023
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

For the approach taken, see #75197 (comment)

Fixes #76702

ghstack-source-id: 60605e0
Pull Request resolved: #76828
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

Fixes #76702

cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

[ghstack-poisoned]
@lezcano lezcano requested a review from ezyang as a code owner February 26, 2023 16:29
@lezcano
Copy link
Collaborator Author

lezcano commented Feb 26, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator

Details for Dev Infra team Raised by workflow job

@lezcano
Copy link
Collaborator Author

lezcano commented Feb 26, 2023

@pytorchbot merge -f "unrelated errors"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator

Details for Dev Infra team Raised by workflow job

@lezcano
Copy link
Collaborator Author

lezcano commented Feb 27, 2023

@ngimel @ezyang could you help landing this one?

@lezcano
Copy link
Collaborator Author

lezcano commented Feb 27, 2023

@pytorchbot merge -f "unrelated errors"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 8c8148c. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).

With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.

For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)

Fixes #76702

cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

[ghstack-poisoned]
@lezcano
Copy link
Collaborator Author

lezcano commented May 10, 2023

Merged in #97355

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul module: performance Issues related to performance, either of kernel code or framework glue open source Reverted topic: performance topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.