Invalidate StorageImpl instances when tensor is overwritten with cudagraphs by isuruf · Pull Request #125264 · pytorch/pytorch

isuruf · 2024-04-30T21:18:52Z

Stack from ghstack (oldest at bottom):

-> Invalidate StorageImpl instances when tensor is overwritten with cudagraphs #125264

cc @mcarilli @ezyang @eellison @penguinwu @anijain2305 @chauhang @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @msaroufim @bdhirsh

[ghstack-poisoned]

pytorch-bot · 2024-04-30T21:18:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125264

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e42683f with merge base 76dca1f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…graphs Fixes #104435 ghstack-source-id: 9611498 Pull Request resolved: #125264

ezyang · 2024-05-01T02:47:22Z

c10/core/StorageImpl.h

+    }
+    TORCH_CHECK(
+        false, "Cannot access data pointer of Storage that is invalid.");
+  }


Don't put this impl in the header, move the error testing code to cpp so it doesn't get inlined, you're going to blow up code size a lot if you don't

ezyang · 2024-05-01T02:49:09Z

Will need benchmarking

c10/core/StorageImpl.h

[ghstack-poisoned]

…graphs Fixes #104435 ghstack-source-id: 3df60a7 Pull Request resolved: #125264

As the next part of #5149, this PR provides some additional info about unstable and infra flaky jobs. * [x] #5149 * [x] Provide link to the issue that mark the job as unstable * [x] Give the label(s) that suppress the job * [x] Print the rule from https://github.com/pytorch/test-infra/blob/generated-stats/stats/flaky-rules.json that marks the job as flaky * [x] Explain the reason for infra flaky (I couldn't find any recent examples for manual testing) ### Testing 1. From pytorch/pytorch#125264 <details open><summary>NEW FAILURE - The following job has failed:</summary> * [pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu)](https://hud.pytorch.org/pr/pytorch/pytorch/125264#24452065236) ([gh](https://github.com/pytorch/pytorch/actions/runs/8901751864/job/24452065236)) `inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_aliasing_static_ref` </details> <details ><summary>FLAKY - The following job failed but was likely due to flakiness present on trunk:</summary> * [Lint / lintrunner-noclang / linux-job](https://hud.pytorch.org/pr/pytorch/pytorch/125264#24446808455) ([gh](https://github.com/pytorch/pytorch/actions/runs/8901751878/job/24446808455)) (matched **linux** rule in [flaky-rules.json](https://github.com/pytorch/test-infra/blob/generated-stats/stats/flaky-rules.json)) `The process '/usr/bin/git' failed with exit code 1` </details> 2. From pytorch/executorch#3318 <details ><summary>UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:</summary> * [Android / test-llama-app / mobile-job (android)](https://hud.pytorch.org/pr/pytorch/executorch/3318#24282434625) ([gh](https://github.com/pytorch/executorch/actions/runs/8842776071/job/24282434625)) ([#3344](pytorch/executorch#3344)) `Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers` </details> 3. From pytorch/pytorch#125143 * [Lint / lintrunner-noclang / linux-job](https://hud.pytorch.org/pr/pytorch/pytorch/125143#24373801771) ([gh](https://github.com/pytorch/pytorch/actions/runs/8878104746/job/24373801771)) `>>> Lint for torch/onnx/_internal/onnx_proto_utils.py:` </details> <details ><summary>FLAKY - The following jobs failed but were likely due to flakiness present on trunk:</summary> * [BC Lint / bc_linter](https://hud.pytorch.org/pr/pytorch/pytorch/125143#24450453134) ([gh](https://github.com/pytorch/pytorch/actions/runs/8878104658/job/24450453134)) (suppressed by suppress-bc-linter) `Process completed with exit code 1.` * [pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 2, 5, linux.4xlarge.nvidia.gpu)](https://hud.pytorch.org/pr/pytorch/pytorch/125143#24374207841) ([gh](https://github.com/pytorch/pytorch/actions/runs/8878104713/job/24374207841)) ([similar failure](https://hud.pytorch.org/pytorch/pytorch/commit/1a0b24776212b383d025010e935f33f58a96e276#24348608242)) `test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_bool` </details>

[ghstack-poisoned]

…graphs Fixes #104435 ghstack-source-id: ebb9b8f Pull Request resolved: #125264

eellison

looks good ! just needs the .h -> .cpp changes ezyang commented about

isuruf · 2024-05-06T17:49:36Z

Test failure is real. However it seems to be highlighting an existing bug/feature?.

The following fails in current main.

import torch
class Mod(torch.nn.Linear):
    def forward(self, x):
        return self.weight.T @ x, self.weight.T, self.weight[0:4]

m = Mod(3, 3).cuda()

@torch.compile(mode="reduce-overhead")
def foo(mod, x):
    return mod(x)

@torch.compile(mode="reduce-overhead")
def foo2(x):
    return x[2:]

x = torch.rand([3, 3], device="cuda", requires_grad=True)
out1, alias_1, alias_2 = foo(m, x)
out2 = foo2(out1)
out2_clone = out2.clone()
out2.sum().backward()
foo(m, x)
assert torch.allclose(out2_clone, out2)

This is not supposed to fail right?

eellison · 2024-05-06T18:14:06Z

You can add torch.compiler.cudagraph_mark_step_begin() at the beginning of the repro to fix, or move the foo(m, x) after the allclose comparison.

That should fail because the foo(m, x) triggers a new invocation of the graph and overwrites existing outputs.

[ghstack-poisoned]

…graphs Fixes #104435 ghstack-source-id: cc78c21 Pull Request resolved: #125264

eellison

Looks good to me, thanks ! maybe @ezyang for additional review

torch/csrc/cuda/Module.cpp

ezyang

Besides benchmarking, this seems basically ok

eellison · 2024-05-31T21:40:53Z

@pytorchbot rebase

pytorchmergebot · 2024-05-31T21:42:21Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

…graphs (pytorch#125264) Fixes pytorch#104435 Pull Request resolved: pytorch#125264 Approved by: https://github.com/ezyang

…ith cudagraphs (pytorch#125264)" This reverts commit 1bc390c. Reverted pytorch#125264 on behalf of https://github.com/jithunnair-amd due to test test/inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_fallback_to_eager_if_recompiling_too_many_times is failing https://github.com/pytorch/pytorch/actions/runs/9933628108/job/27477785946 https://hud.pytorch.org/pytorch/pytorch/commit/1bc390c5f5ac065c156f55f4eceed267ecc67b41. Test was introduced by pytorch@fa5f572 which is before the merge base ([comment](pytorch#125264 (comment)))

…graphs (pytorch#125264) Fixes pytorch#104435 Pull Request resolved: pytorch#125264 Approved by: https://github.com/ezyang

…ith cudagraphs (pytorch#125264)" This reverts commit 8390843. Reverted pytorch#125264 on behalf of https://github.com/izaitsevfb due to breaks internal tests ([comment](pytorch#125264 (comment)))

github-actions · 2024-09-18T08:36:00Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

…n with cudagraphs" Fixes #104435 cc mcarilli ezyang eellison penguinwu anijain2305 chauhang voznesenskym EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire msaroufim bdhirsh [ghstack-poisoned]

…graphs Fixes #104435 ghstack-source-id: 017c5c3 Pull Request resolved: #125264

…n with cudagraphs" Fixes #104435 cc mcarilli ezyang eellison penguinwu anijain2305 chauhang voznesenskym EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire msaroufim bdhirsh [ghstack-poisoned]

…graphs Fixes #104435 ghstack-source-id: 599bf9f Pull Request resolved: #125264

…n with cudagraphs" Fixes #104435 cc mcarilli ezyang eellison penguinwu anijain2305 chauhang voznesenskym EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire msaroufim bdhirsh [ghstack-poisoned]

…graphs Fixes #104435 ghstack-source-id: 402d7ca Pull Request resolved: #125264

…n with cudagraphs" Fixes #104435 cc mcarilli ezyang eellison penguinwu anijain2305 chauhang voznesenskym EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire msaroufim bdhirsh [ghstack-poisoned]

…graphs Fixes #104435 ghstack-source-id: 9b6dbb5 Pull Request resolved: #125264

…n with cudagraphs" Fixes #104435 cc mcarilli ezyang eellison penguinwu anijain2305 chauhang voznesenskym EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire msaroufim bdhirsh [ghstack-poisoned]

…graphs Fixes #104435 ghstack-source-id: a8efe72 Pull Request resolved: #125264

eellison · 2024-10-08T23:58:32Z

@pytorchbot merge

pytorchmergebot · 2024-10-09T00:00:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Update

d79a12b

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor oncall: pt2 labels Apr 30, 2024

isuruf added a commit that referenced this pull request Apr 30, 2024

Invalidate StorageImpl instances when tensor is overwritten with cuda…

234908c

…graphs Fixes #104435 ghstack-source-id: 9611498 Pull Request resolved: #125264

pytorchbot added the open source label Apr 30, 2024

eellison requested review from BoyuanFeng and eellison May 1, 2024 00:10

huydhn mentioned this pull request May 1, 2024

Provide additional info about unstable and infra flaky jobs pytorch/test-infra#5156

Merged

5 tasks

ezyang requested review from ezyang and kurtamohler May 1, 2024 02:31

ezyang reviewed May 1, 2024

View reviewed changes

kurtamohler reviewed May 1, 2024

View reviewed changes

c10/core/StorageImpl.h Show resolved Hide resolved

Update

bfe7319

[ghstack-poisoned]

isuruf added a commit that referenced this pull request May 1, 2024

Invalidate StorageImpl instances when tensor is overwritten with cuda…

43fb225

…graphs Fixes #104435 ghstack-source-id: 3df60a7 Pull Request resolved: #125264

Update

9e61092

[ghstack-poisoned]

isuruf added a commit that referenced this pull request May 2, 2024

Invalidate StorageImpl instances when tensor is overwritten with cuda…

645bef7

…graphs Fixes #104435 ghstack-source-id: ebb9b8f Pull Request resolved: #125264

eellison reviewed May 6, 2024

View reviewed changes

Update

bcd8787

[ghstack-poisoned]

isuruf added a commit that referenced this pull request May 6, 2024

Invalidate StorageImpl instances when tensor is overwritten with cuda…

ba68842

…graphs Fixes #104435 ghstack-source-id: cc78c21 Pull Request resolved: #125264

eellison reviewed May 9, 2024

View reviewed changes

ezyang reviewed May 9, 2024

View reviewed changes

torch/csrc/cuda/Module.cpp Show resolved Hide resolved

ezyang approved these changes May 10, 2024

View reviewed changes

henrylhtsang mentioned this pull request Jul 31, 2024

[BE][typing] fix types in common pruning #132309

Closed

github-actions bot added the Stale label Sep 18, 2024

eellison requested a review from syed-ahmed as a code owner October 3, 2024 16:20

eellison added a commit that referenced this pull request Oct 3, 2024

Invalidate StorageImpl instances when tensor is overwritten with cuda…

0c26511

…graphs Fixes #104435 ghstack-source-id: 017c5c3 Pull Request resolved: #125264

eellison added a commit that referenced this pull request Oct 3, 2024

Invalidate StorageImpl instances when tensor is overwritten with cuda…

81eb168

…graphs Fixes #104435 ghstack-source-id: 599bf9f Pull Request resolved: #125264

eellison added a commit that referenced this pull request Oct 4, 2024

Invalidate StorageImpl instances when tensor is overwritten with cuda…

981cdfd

…graphs Fixes #104435 ghstack-source-id: 402d7ca Pull Request resolved: #125264

eellison added a commit that referenced this pull request Oct 7, 2024

Invalidate StorageImpl instances when tensor is overwritten with cuda…

734280f

…graphs Fixes #104435 ghstack-source-id: 9b6dbb5 Pull Request resolved: #125264

eellison added a commit that referenced this pull request Oct 8, 2024

Invalidate StorageImpl instances when tensor is overwritten with cuda…

664d410

…graphs Fixes #104435 ghstack-source-id: a8efe72 Pull Request resolved: #125264

pytorchmergebot added the merging label Oct 9, 2024

pytorchmergebot closed this in 8893881 Oct 9, 2024

pytorchmergebot removed the merging label Oct 9, 2024

github-actions bot deleted the gh/isuruf/47/head branch November 8, 2024 02:06

fffrog mentioned this pull request Jan 29, 2026

[PrivateUse1]How to set privateuse1 to cpu tensor.data? #173021

Closed

Conversation

isuruf commented Apr 30, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125264

✅ No Failures

Uh oh!

ezyang May 1, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang commented May 1, 2024

Uh oh!

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

isuruf commented May 6, 2024

Uh oh!

eellison commented May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

eellison commented May 31, 2024

Uh oh!

pytorchmergebot commented May 31, 2024

Uh oh!

github-actions bot commented Sep 18, 2024

Uh oh!

eellison commented Oct 8, 2024

Uh oh!

pytorchmergebot commented Oct 9, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

isuruf commented Apr 30, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Apr 30, 2024 •

edited

Loading

eellison commented May 6, 2024 •

edited

Loading