Conversation
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 [ghstack-poisoned]
🔗 Helpful links
❌ 8 New FailuresAs of commit 40e4c01 (more details on the Dr. CI page): Expand to see more
🕵️ 8 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages
|
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 ghstack-source-id: c42f61a Pull Request resolved: #76828
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 [ghstack-poisoned]
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 ghstack-source-id: d78c6bb Pull Request resolved: #76828
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 [ghstack-poisoned]
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 [ghstack-poisoned]
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 [ghstack-poisoned]
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) Fixes #76702 [ghstack-poisoned]
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 ghstack-source-id: 7593983 Pull Request resolved: #76828
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) Fixes #76702 cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano [ghstack-poisoned]
… the backward of matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]
… matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 ghstack-source-id: 835ab33 Pull Request resolved: #76828
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) Fixes #76702 cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano [ghstack-poisoned]
… the backward of matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]
… matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 ghstack-source-id: 499c931 Pull Request resolved: #76828
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) Fixes #76702 cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano [ghstack-poisoned]
… the backward of matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]
… matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) For the approach taken, see #75197 (comment) Fixes #76702 ghstack-source-id: 60605e0 Pull Request resolved: #76828
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) Fixes #76702 cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: This PR has internal changes and must be landed via Phabricator Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f "unrelated errors" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: This PR has internal changes and must be landed via Phabricator Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f "unrelated errors" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
This pull request has been reverted by 8c8148c. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk). |
With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment) Fixes #76702 cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano [ghstack-poisoned]
|
Merged in #97355 |
Stack from ghstack (oldest at bottom):
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.
For the cases where it was copying before see #75197 (comment) #75197 (comment) #75197 (comment)
Fixes #76702
cc @ngimel @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano