Test that FSDP2 works with cuda graphs. by galv · Pull Request #171835 · pytorch/pytorch

galv · 2026-01-06T23:23:49Z

I initially wrote in #164264 that there was a missing wait_stream() call to put a stream into stream capture mode, but surprisingly since I made that issue the problem has been fixed. I was not able to locate the exact commit that coincidentally made that fix after a brief search. Since CachingHostAllocator supports memory allocation during stream capture since #167507, the purpose of this PR is simply to make sure that the support does not regress.

An important detail is that we need to make sure that cuda graph still overlaps the all-gather and reduce-scatter streams with computation streams. To check for that, I applied this patch:

diff --git a/test/distributed/_composable/fsdp/test_fully_shard_training.py b/test/distributed/_composable/fsdp/test_fully_shard_training.py
index c0831d87d7c..c0fecdf787d 100644
--- a/test/distributed/_composable/fsdp/test_fully_shard_training.py
+++ b/test/distributed/_composable/fsdp/test_fully_shard_training.py
@@ -1681,8 +1681,8 @@ class TestFullyShardCudaGraph(FSDPTest):
         device = torch.device(device_type.type, self.rank)
         torch.manual_seed(42)
         model = nn.Sequential(
-            nn.Linear(8, 8, bias=False),
-            nn.Linear(8, 8, bias=False),
+            nn.Linear(4096, 4096, bias=False),
+            nn.Linear(4096, 4096, bias=False),
         ).to(device)
         for param in model.parameters():
             dist.broadcast(param, src=0)
@@ -1694,7 +1694,7 @@ class TestFullyShardCudaGraph(FSDPTest):

         # warmup
         with torch.cuda.stream(stream):
-            input_tensor = torch.randn(4, 8, device=device)
+            input_tensor = torch.randn(4, 4096, device=device)
             output = model(input_tensor)
             output.sum().backward()
             model.zero_grad(set_to_none=True)
@@ -1711,7 +1711,7 @@ class TestFullyShardCudaGraph(FSDPTest):
             ]

         # equivalence check
-        with torch.cuda.stream(stream):
+        with torch.cuda.stream(stream), torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True) as prof:
             for _ in range(2):
                 replay_input = torch.randn(4, 8, device=device)
                 ref_output = model(replay_input)
@@ -1726,6 +1726,8 @@ class TestFullyShardCudaGraph(FSDPTest):
                 for graph_grad, ref_grad in zip(static_output_grads, ref_grads):
                     self.assertTrue(torch.equal(graph_grad, ref_grad))
                 model.zero_grad(set_to_none=True)
+                prof.step()
+        prof.export_chrome_trace(f"two_layer_fully_shard_cudagraph_{self.rank}.json")

 if __name__ == "__main__":

I then inspection the json file manually to check for overlap.

Closes issue #164264

Fixes #164264

cc @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng

pytorch-bot · 2026-01-06T23:23:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/171835

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit c88f2b1 with merge base 68370db ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, lf.linux.arm64.m8g.4xlarge) (gh) (similar failure)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2026-01-06T23:23:56Z

The committers listed above are authorized under a signed CLA.

✅ login: ezyang / name: Edward Z. Yang (c88f2b1)
✅ login: galv / name: Daniel Galvez (0452df8, 8724462)
✅ login: Skylion007 / name: Aaron Gokaslan (8724462)

test/distributed/_composable/fsdp/test_fully_shard_training.py

I initially wrote in pytorch#164264 that there was a missing wait_stream() call to put a stream into stream capture mode, but surprisingly since I made that issue the problem has been fixed. I was not able to locate the exact commit that coincidentally made that fix after a brief search. Since CachingHostAllocator supports memory allocation during stream capture since pytorch#167507, the purpose of this PR is simply to make sure that the support does not regress. An important detail is that we need to make sure that cuda graph still overlaps the all-gather and reduce-scatter streams with computation streams. To check for that, I applied this patch: ``` diff --git a/test/distributed/_composable/fsdp/test_fully_shard_training.py b/test/distributed/_composable/fsdp/test_fully_shard_training.py index c0831d8..c0fecdf787d 100644 --- a/test/distributed/_composable/fsdp/test_fully_shard_training.py +++ b/test/distributed/_composable/fsdp/test_fully_shard_training.py @@ -1681,8 +1681,8 @@ class TestFullyShardCudaGraph(FSDPTest): device = torch.device(device_type.type, self.rank) torch.manual_seed(42) model = nn.Sequential( - nn.Linear(8, 8, bias=False), - nn.Linear(8, 8, bias=False), + nn.Linear(4096, 4096, bias=False), + nn.Linear(4096, 4096, bias=False), ).to(device) for param in model.parameters(): dist.broadcast(param, src=0) @@ -1694,7 +1694,7 @@ class TestFullyShardCudaGraph(FSDPTest): # warmup with torch.cuda.stream(stream): - input_tensor = torch.randn(4, 8, device=device) + input_tensor = torch.randn(4, 4096, device=device) output = model(input_tensor) output.sum().backward() model.zero_grad(set_to_none=True) @@ -1711,7 +1711,7 @@ class TestFullyShardCudaGraph(FSDPTest): ] # equivalence check - with torch.cuda.stream(stream): + with torch.cuda.stream(stream), torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True) as prof: for _ in range(2): replay_input = torch.randn(4, 8, device=device) ref_output = model(replay_input) @@ -1726,6 +1726,8 @@ class TestFullyShardCudaGraph(FSDPTest): for graph_grad, ref_grad in zip(static_output_grads, ref_grads): self.assertTrue(torch.equal(graph_grad, ref_grad)) model.zero_grad(set_to_none=True) + prof.step() + prof.export_chrome_trace(f"two_layer_fully_shard_cudagraph_{self.rank}.json") if __name__ == "__main__": ``` I then inspection the json file manually to check for overlap. Closes issue pytorch#164264

Accidentally skipped the test. `python test/distributed/_composable/fsdp/test_fully_shard_training.py TestFullyShardCudaGraph.test_two_layer_fully_shard_cudagraph` ignores the unittest.skipIf decorator! Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>

Signed-off-by: Edward Yang <ezyang@meta.com>

ezyang · 2026-01-07T01:22:00Z

@pytorchbot merge

pytorchmergebot · 2026-01-07T01:24:00Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ngimel · 2026-01-07T01:36:28Z

test/distributed/_composable/fsdp/test_fully_shard_training.py

+                ]
+
+                static_input.copy_(replay_input)
+                graph.replay()


lol if you attempted to do this for real you would be debugging for a long time why gradients are not accumulated, but for tests this is good enough

For what it's worth, the idiom of doing model.zero_grad(set_to_none=True) before stream capture comes from the original pytorch blog post on cuda graph, so it wouldn't surprise me if most code out in the wild does this.

Very unfortunate for anyone who might be using cuda graphs this way and trying to do gradient accumulation over multiple minibatches 😬

pytorchmergebot · 2026-01-07T05:10:40Z

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, lf.linux.arm64.m8g.4xlarge)

Details for Dev Infra team

Raised by workflow job

msaroufim · 2026-01-07T07:38:09Z

@pytorchbot merge -i

pytorchmergebot · 2026-01-07T07:40:09Z

Merge started

Your change will be merged while ignoring the following 1 checks: linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, lf.linux.arm64.m8g.4xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

I initially wrote in pytorch#164264 that there was a missing wait_stream() call to put a stream into stream capture mode, but surprisingly since I made that issue the problem has been fixed. I was not able to locate the exact commit that coincidentally made that fix after a brief search. Since CachingHostAllocator supports memory allocation during stream capture since pytorch#167507, the purpose of this PR is simply to make sure that the support does not regress. An important detail is that we need to make sure that cuda graph still overlaps the all-gather and reduce-scatter streams with computation streams. To check for that, I applied this patch: ``` diff --git a/test/distributed/_composable/fsdp/test_fully_shard_training.py b/test/distributed/_composable/fsdp/test_fully_shard_training.py index c0831d8..c0fecdf787d 100644 --- a/test/distributed/_composable/fsdp/test_fully_shard_training.py +++ b/test/distributed/_composable/fsdp/test_fully_shard_training.py @@ -1681,8 +1681,8 @@ class TestFullyShardCudaGraph(FSDPTest): device = torch.device(device_type.type, self.rank) torch.manual_seed(42) model = nn.Sequential( - nn.Linear(8, 8, bias=False), - nn.Linear(8, 8, bias=False), + nn.Linear(4096, 4096, bias=False), + nn.Linear(4096, 4096, bias=False), ).to(device) for param in model.parameters(): dist.broadcast(param, src=0) @@ -1694,7 +1694,7 @@ class TestFullyShardCudaGraph(FSDPTest): # warmup with torch.cuda.stream(stream): - input_tensor = torch.randn(4, 8, device=device) + input_tensor = torch.randn(4, 4096, device=device) output = model(input_tensor) output.sum().backward() model.zero_grad(set_to_none=True) @@ -1711,7 +1711,7 @@ class TestFullyShardCudaGraph(FSDPTest): ] # equivalence check - with torch.cuda.stream(stream): + with torch.cuda.stream(stream), torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True) as prof: for _ in range(2): replay_input = torch.randn(4, 8, device=device) ref_output = model(replay_input) @@ -1726,6 +1726,8 @@ class TestFullyShardCudaGraph(FSDPTest): for graph_grad, ref_grad in zip(static_output_grads, ref_grads): self.assertTrue(torch.equal(graph_grad, ref_grad)) model.zero_grad(set_to_none=True) + prof.step() + prof.export_chrome_trace(f"two_layer_fully_shard_cudagraph_{self.rank}.json") if __name__ == "__main__": ``` I then inspection the json file manually to check for overlap. Closes issue pytorch#164264 Fixes pytorch#164264 Pull Request resolved: pytorch#171835 Approved by: https://github.com/ezyang, https://github.com/ngimel, https://github.com/BoyuanFeng, https://github.com/eellison Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Edward Yang <ezyang@meta.com>

weifengpy · 2026-01-26T22:33:01Z

@galv Thanks so much for the change! This means a lot to keep fsdp2 relevant in the era of grace cpu

galv assigned weifengpy Jan 6, 2026

pytorch-bot bot added the topic: not user facing topic category label Jan 6, 2026

galv added the module: cuda graphs Ability to capture and then replay streams of CUDA kernels label Jan 6, 2026

Skylion007 reviewed Jan 6, 2026

View reviewed changes

test/distributed/_composable/fsdp/test_fully_shard_training.py Outdated Show resolved Hide resolved

galv force-pushed the fsdp2-cuda-graph branch from 9fe4472 to 0452df8 Compare January 6, 2026 23:29

pytorchbot added the open source label Jan 6, 2026

Don't skip the test

8724462

Accidentally skipped the test. `python test/distributed/_composable/fsdp/test_fully_shard_training.py TestFullyShardCudaGraph.test_two_layer_fully_shard_cudagraph` ignores the unittest.skipIf decorator! Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>

ezyang approved these changes Jan 7, 2026

View reviewed changes

lint

c88f2b1

Signed-off-by: Edward Yang <ezyang@meta.com>

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 7, 2026

pytorchmergebot added the merging label Jan 7, 2026

ngimel reviewed Jan 7, 2026

View reviewed changes

ngimel approved these changes Jan 7, 2026

View reviewed changes

BoyuanFeng approved these changes Jan 7, 2026

View reviewed changes

eellison approved these changes Jan 7, 2026

View reviewed changes

pytorchmergebot removed the merging label Jan 7, 2026

pytorchmergebot added the merging label Jan 7, 2026

pytorchmergebot added the Merged label Jan 7, 2026

pytorchmergebot closed this in 02d3f29 Jan 7, 2026

pytorchmergebot removed the merging label Jan 7, 2026

BoyuanFeng mentioned this pull request Jan 16, 2026

Cuda graph support for FSDP2 is lacking. #164264

Closed

weifengpy added the release notes: distributed (fsdp2) release notes category label Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test that FSDP2 works with cuda graphs.#171835

Test that FSDP2 works with cuda graphs.#171835
galv wants to merge 3 commits intopytorch:mainfrom
galv:fsdp2-cuda-graph

galv commented Jan 6, 2026 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

ezyang commented Jan 7, 2026

Uh oh!

pytorchmergebot commented Jan 7, 2026

Uh oh!

ngimel Jan 7, 2026

Uh oh!

galv Jan 7, 2026

Uh oh!

pytorchmergebot commented Jan 7, 2026

Uh oh!

msaroufim commented Jan 7, 2026

Uh oh!

pytorchmergebot commented Jan 7, 2026

Uh oh!

weifengpy commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

galv commented Jan 6, 2026 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/171835

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

linux-foundation-easycla bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ezyang commented Jan 7, 2026

Uh oh!

pytorchmergebot commented Jan 7, 2026

Merge started

Uh oh!

ngimel Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

galv Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Jan 7, 2026

Merge failed

Uh oh!

msaroufim commented Jan 7, 2026

Uh oh!

pytorchmergebot commented Jan 7, 2026

Merge started

Uh oh!

weifengpy commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

galv commented Jan 6, 2026 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jan 6, 2026 •

edited

Loading

linux-foundation-easycla bot commented Jan 6, 2026 •

edited

Loading