Invalidate StorageImpl instances when tensor is overwritten with cudagraphs#125264
Invalidate StorageImpl instances when tensor is overwritten with cudagraphs#125264isuruf wants to merge 27 commits intogh/isuruf/47/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125264
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e42683f with merge base 76dca1f ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
c10/core/StorageImpl.h
Outdated
| } | ||
| TORCH_CHECK( | ||
| false, "Cannot access data pointer of Storage that is invalid."); | ||
| } |
There was a problem hiding this comment.
Don't put this impl in the header, move the error testing code to cpp so it doesn't get inlined, you're going to blow up code size a lot if you don't
|
Will need benchmarking |
As the next part of #5149, this PR provides some additional info about unstable and infra flaky jobs. * [x] #5149 * [x] Provide link to the issue that mark the job as unstable * [x] Give the label(s) that suppress the job * [x] Print the rule from https://github.com/pytorch/test-infra/blob/generated-stats/stats/flaky-rules.json that marks the job as flaky * [x] Explain the reason for infra flaky (I couldn't find any recent examples for manual testing) ### Testing 1. From pytorch/pytorch#125264 <details open><summary><b>NEW FAILURE</b> - The following job has failed:</summary><p> * [pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu)](https://hud.pytorch.org/pr/pytorch/pytorch/125264#24452065236) ([gh](https://github.com/pytorch/pytorch/actions/runs/8901751864/job/24452065236)) `inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_aliasing_static_ref` </p></details> <details ><summary><b>FLAKY</b> - The following job failed but was likely due to flakiness present on trunk:</summary><p> * [Lint / lintrunner-noclang / linux-job](https://hud.pytorch.org/pr/pytorch/pytorch/125264#24446808455) ([gh](https://github.com/pytorch/pytorch/actions/runs/8901751878/job/24446808455)) (matched **linux** rule in [flaky-rules.json](https://github.com/pytorch/test-infra/blob/generated-stats/stats/flaky-rules.json)) `The process '/usr/bin/git' failed with exit code 1` </p></details> 2. From pytorch/executorch#3318 <details ><summary><b>UNSTABLE</b> - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:</summary><p> * [Android / test-llama-app / mobile-job (android)](https://hud.pytorch.org/pr/pytorch/executorch/3318#24282434625) ([gh](https://github.com/pytorch/executorch/actions/runs/8842776071/job/24282434625)) ([#3344](pytorch/executorch#3344)) `Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers` </p></details> 3. From pytorch/pytorch#125143 * [Lint / lintrunner-noclang / linux-job](https://hud.pytorch.org/pr/pytorch/pytorch/125143#24373801771) ([gh](https://github.com/pytorch/pytorch/actions/runs/8878104746/job/24373801771)) `>>> Lint for torch/onnx/_internal/onnx_proto_utils.py:` </p></details> <details ><summary><b>FLAKY</b> - The following jobs failed but were likely due to flakiness present on trunk:</summary><p> * [BC Lint / bc_linter](https://hud.pytorch.org/pr/pytorch/pytorch/125143#24450453134) ([gh](https://github.com/pytorch/pytorch/actions/runs/8878104658/job/24450453134)) (suppressed by suppress-bc-linter) `Process completed with exit code 1.` * [pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 2, 5, linux.4xlarge.nvidia.gpu)](https://hud.pytorch.org/pr/pytorch/pytorch/125143#24374207841) ([gh](https://github.com/pytorch/pytorch/actions/runs/8878104713/job/24374207841)) ([similar failure](https://hud.pytorch.org/pytorch/pytorch/commit/1a0b24776212b383d025010e935f33f58a96e276#24348608242)) `test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_bool` </p></details>
eellison
left a comment
There was a problem hiding this comment.
looks good ! just needs the .h -> .cpp changes ezyang commented about
|
Test failure is real. However it seems to be highlighting an existing bug/feature?. The following fails in current main. import torch
class Mod(torch.nn.Linear):
def forward(self, x):
return self.weight.T @ x, self.weight.T, self.weight[0:4]
m = Mod(3, 3).cuda()
@torch.compile(mode="reduce-overhead")
def foo(mod, x):
return mod(x)
@torch.compile(mode="reduce-overhead")
def foo2(x):
return x[2:]
x = torch.rand([3, 3], device="cuda", requires_grad=True)
out1, alias_1, alias_2 = foo(m, x)
out2 = foo2(out1)
out2_clone = out2.clone()
out2.sum().backward()
foo(m, x)
assert torch.allclose(out2_clone, out2)This is not supposed to fail right? |
|
You can add That should fail because the foo(m, x) triggers a new invocation of the graph and overwrites existing outputs. |
ezyang
left a comment
There was a problem hiding this comment.
Besides benchmarking, this seems basically ok
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
…graphs (pytorch#125264) Fixes pytorch#104435 Pull Request resolved: pytorch#125264 Approved by: https://github.com/ezyang
…ith cudagraphs (pytorch#125264)" This reverts commit 1bc390c. Reverted pytorch#125264 on behalf of https://github.com/jithunnair-amd due to test test/inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_fallback_to_eager_if_recompiling_too_many_times is failing https://github.com/pytorch/pytorch/actions/runs/9933628108/job/27477785946 https://hud.pytorch.org/pytorch/pytorch/commit/1bc390c5f5ac065c156f55f4eceed267ecc67b41. Test was introduced by pytorch@fa5f572 which is before the merge base ([comment](pytorch#125264 (comment)))
…graphs (pytorch#125264) Fixes pytorch#104435 Pull Request resolved: pytorch#125264 Approved by: https://github.com/ezyang
…ith cudagraphs (pytorch#125264)" This reverts commit 8390843. Reverted pytorch#125264 on behalf of https://github.com/izaitsevfb due to breaks internal tests ([comment](pytorch#125264 (comment)))
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
…n with cudagraphs" Fixes #104435 cc mcarilli ezyang eellison penguinwu anijain2305 chauhang voznesenskym EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire msaroufim bdhirsh [ghstack-poisoned]
…n with cudagraphs" Fixes #104435 cc mcarilli ezyang eellison penguinwu anijain2305 chauhang voznesenskym EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire msaroufim bdhirsh [ghstack-poisoned]
…n with cudagraphs" Fixes #104435 cc mcarilli ezyang eellison penguinwu anijain2305 chauhang voznesenskym EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire msaroufim bdhirsh [ghstack-poisoned]
…n with cudagraphs" Fixes #104435 cc mcarilli ezyang eellison penguinwu anijain2305 chauhang voznesenskym EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire msaroufim bdhirsh [ghstack-poisoned]
…n with cudagraphs" Fixes #104435 cc mcarilli ezyang eellison penguinwu anijain2305 chauhang voznesenskym EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire msaroufim bdhirsh [ghstack-poisoned]
…n with cudagraphs" Fixes #104435 cc mcarilli ezyang eellison penguinwu anijain2305 chauhang voznesenskym EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire msaroufim bdhirsh [ghstack-poisoned]
…n with cudagraphs" Fixes #104435 cc mcarilli ezyang eellison penguinwu anijain2305 chauhang voznesenskym EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire msaroufim bdhirsh [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
Fixes #104435
cc @mcarilli @ezyang @eellison @penguinwu @anijain2305 @chauhang @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @msaroufim @bdhirsh