Test that FSDP2 works with cuda graphs.#171835
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/171835
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit c88f2b1 with merge base 68370db ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
I initially wrote in pytorch#164264 that there was a missing wait_stream() call to put a stream into stream capture mode, but surprisingly since I made that issue the problem has been fixed. I was not able to locate the exact commit that coincidentally made that fix after a brief search. Since CachingHostAllocator supports memory allocation during stream capture since pytorch#167507, the purpose of this PR is simply to make sure that the support does not regress. An important detail is that we need to make sure that cuda graph still overlaps the all-gather and reduce-scatter streams with computation streams. To check for that, I applied this patch: ``` diff --git a/test/distributed/_composable/fsdp/test_fully_shard_training.py b/test/distributed/_composable/fsdp/test_fully_shard_training.py index c0831d8..c0fecdf787d 100644 --- a/test/distributed/_composable/fsdp/test_fully_shard_training.py +++ b/test/distributed/_composable/fsdp/test_fully_shard_training.py @@ -1681,8 +1681,8 @@ class TestFullyShardCudaGraph(FSDPTest): device = torch.device(device_type.type, self.rank) torch.manual_seed(42) model = nn.Sequential( - nn.Linear(8, 8, bias=False), - nn.Linear(8, 8, bias=False), + nn.Linear(4096, 4096, bias=False), + nn.Linear(4096, 4096, bias=False), ).to(device) for param in model.parameters(): dist.broadcast(param, src=0) @@ -1694,7 +1694,7 @@ class TestFullyShardCudaGraph(FSDPTest): # warmup with torch.cuda.stream(stream): - input_tensor = torch.randn(4, 8, device=device) + input_tensor = torch.randn(4, 4096, device=device) output = model(input_tensor) output.sum().backward() model.zero_grad(set_to_none=True) @@ -1711,7 +1711,7 @@ class TestFullyShardCudaGraph(FSDPTest): ] # equivalence check - with torch.cuda.stream(stream): + with torch.cuda.stream(stream), torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True) as prof: for _ in range(2): replay_input = torch.randn(4, 8, device=device) ref_output = model(replay_input) @@ -1726,6 +1726,8 @@ class TestFullyShardCudaGraph(FSDPTest): for graph_grad, ref_grad in zip(static_output_grads, ref_grads): self.assertTrue(torch.equal(graph_grad, ref_grad)) model.zero_grad(set_to_none=True) + prof.step() + prof.export_chrome_trace(f"two_layer_fully_shard_cudagraph_{self.rank}.json") if __name__ == "__main__": ``` I then inspection the json file manually to check for overlap. Closes issue pytorch#164264
Accidentally skipped the test. `python test/distributed/_composable/fsdp/test_fully_shard_training.py TestFullyShardCudaGraph.test_two_layer_fully_shard_cudagraph` ignores the unittest.skipIf decorator! Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
| ] | ||
|
|
||
| static_input.copy_(replay_input) | ||
| graph.replay() |
There was a problem hiding this comment.
lol if you attempted to do this for real you would be debugging for a long time why gradients are not accumulated, but for tests this is good enough
There was a problem hiding this comment.
For what it's worth, the idiom of doing model.zero_grad(set_to_none=True) before stream capture comes from the original pytorch blog post on cuda graph, so it wouldn't surprise me if most code out in the wild does this.
Very unfortunate for anyone who might be using cuda graphs this way and trying to do gradient accumulation over multiple minibatches 😬
Merge failedReason: 1 jobs have failed, first few of them are: linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, lf.linux.arm64.m8g.4xlarge) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, lf.linux.arm64.m8g.4xlarge) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
I initially wrote in pytorch#164264 that there was a missing wait_stream() call to put a stream into stream capture mode, but surprisingly since I made that issue the problem has been fixed. I was not able to locate the exact commit that coincidentally made that fix after a brief search. Since CachingHostAllocator supports memory allocation during stream capture since pytorch#167507, the purpose of this PR is simply to make sure that the support does not regress. An important detail is that we need to make sure that cuda graph still overlaps the all-gather and reduce-scatter streams with computation streams. To check for that, I applied this patch: ``` diff --git a/test/distributed/_composable/fsdp/test_fully_shard_training.py b/test/distributed/_composable/fsdp/test_fully_shard_training.py index c0831d8..c0fecdf787d 100644 --- a/test/distributed/_composable/fsdp/test_fully_shard_training.py +++ b/test/distributed/_composable/fsdp/test_fully_shard_training.py @@ -1681,8 +1681,8 @@ class TestFullyShardCudaGraph(FSDPTest): device = torch.device(device_type.type, self.rank) torch.manual_seed(42) model = nn.Sequential( - nn.Linear(8, 8, bias=False), - nn.Linear(8, 8, bias=False), + nn.Linear(4096, 4096, bias=False), + nn.Linear(4096, 4096, bias=False), ).to(device) for param in model.parameters(): dist.broadcast(param, src=0) @@ -1694,7 +1694,7 @@ class TestFullyShardCudaGraph(FSDPTest): # warmup with torch.cuda.stream(stream): - input_tensor = torch.randn(4, 8, device=device) + input_tensor = torch.randn(4, 4096, device=device) output = model(input_tensor) output.sum().backward() model.zero_grad(set_to_none=True) @@ -1711,7 +1711,7 @@ class TestFullyShardCudaGraph(FSDPTest): ] # equivalence check - with torch.cuda.stream(stream): + with torch.cuda.stream(stream), torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True) as prof: for _ in range(2): replay_input = torch.randn(4, 8, device=device) ref_output = model(replay_input) @@ -1726,6 +1726,8 @@ class TestFullyShardCudaGraph(FSDPTest): for graph_grad, ref_grad in zip(static_output_grads, ref_grads): self.assertTrue(torch.equal(graph_grad, ref_grad)) model.zero_grad(set_to_none=True) + prof.step() + prof.export_chrome_trace(f"two_layer_fully_shard_cudagraph_{self.rank}.json") if __name__ == "__main__": ``` I then inspection the json file manually to check for overlap. Closes issue pytorch#164264 Fixes pytorch#164264 Pull Request resolved: pytorch#171835 Approved by: https://github.com/ezyang, https://github.com/ngimel, https://github.com/BoyuanFeng, https://github.com/eellison Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Edward Yang <ezyang@meta.com>
I initially wrote in pytorch#164264 that there was a missing wait_stream() call to put a stream into stream capture mode, but surprisingly since I made that issue the problem has been fixed. I was not able to locate the exact commit that coincidentally made that fix after a brief search. Since CachingHostAllocator supports memory allocation during stream capture since pytorch#167507, the purpose of this PR is simply to make sure that the support does not regress. An important detail is that we need to make sure that cuda graph still overlaps the all-gather and reduce-scatter streams with computation streams. To check for that, I applied this patch: ``` diff --git a/test/distributed/_composable/fsdp/test_fully_shard_training.py b/test/distributed/_composable/fsdp/test_fully_shard_training.py index c0831d8..c0fecdf787d 100644 --- a/test/distributed/_composable/fsdp/test_fully_shard_training.py +++ b/test/distributed/_composable/fsdp/test_fully_shard_training.py @@ -1681,8 +1681,8 @@ class TestFullyShardCudaGraph(FSDPTest): device = torch.device(device_type.type, self.rank) torch.manual_seed(42) model = nn.Sequential( - nn.Linear(8, 8, bias=False), - nn.Linear(8, 8, bias=False), + nn.Linear(4096, 4096, bias=False), + nn.Linear(4096, 4096, bias=False), ).to(device) for param in model.parameters(): dist.broadcast(param, src=0) @@ -1694,7 +1694,7 @@ class TestFullyShardCudaGraph(FSDPTest): # warmup with torch.cuda.stream(stream): - input_tensor = torch.randn(4, 8, device=device) + input_tensor = torch.randn(4, 4096, device=device) output = model(input_tensor) output.sum().backward() model.zero_grad(set_to_none=True) @@ -1711,7 +1711,7 @@ class TestFullyShardCudaGraph(FSDPTest): ] # equivalence check - with torch.cuda.stream(stream): + with torch.cuda.stream(stream), torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True) as prof: for _ in range(2): replay_input = torch.randn(4, 8, device=device) ref_output = model(replay_input) @@ -1726,6 +1726,8 @@ class TestFullyShardCudaGraph(FSDPTest): for graph_grad, ref_grad in zip(static_output_grads, ref_grads): self.assertTrue(torch.equal(graph_grad, ref_grad)) model.zero_grad(set_to_none=True) + prof.step() + prof.export_chrome_trace(f"two_layer_fully_shard_cudagraph_{self.rank}.json") if __name__ == "__main__": ``` I then inspection the json file manually to check for overlap. Closes issue pytorch#164264 Fixes pytorch#164264 Pull Request resolved: pytorch#171835 Approved by: https://github.com/ezyang, https://github.com/ngimel, https://github.com/BoyuanFeng, https://github.com/eellison Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Edward Yang <ezyang@meta.com>
|
@galv Thanks so much for the change! This means a lot to keep fsdp2 relevant in the era of grace cpu |
I initially wrote in #164264 that there was a missing wait_stream() call to put a stream into stream capture mode, but surprisingly since I made that issue the problem has been fixed. I was not able to locate the exact commit that coincidentally made that fix after a brief search. Since CachingHostAllocator supports memory allocation during stream capture since #167507, the purpose of this PR is simply to make sure that the support does not regress.
An important detail is that we need to make sure that cuda graph still overlaps the all-gather and reduce-scatter streams with computation streams. To check for that, I applied this patch:
I then inspection the json file manually to check for overlap.
Closes issue #164264
Fixes #164264
cc @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng