[dynamo] Add FakeProcessGroup support for fx_graph_runnable with distributed collectives#157162
[dynamo] Add FakeProcessGroup support for fx_graph_runnable with distributed collectives#157162skarjala wants to merge 14 commits intogh/skarjala/10/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157162
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 Cancelled JobsAs of commit 544ce40 with merge base 178fe7a ( CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| from torch.distributed._tensor import DeviceMesh, DTensor, Replicate, Shard | ||
| from torch.testing._internal.common_utils import IS_FBCODE, IS_SANDCASTLE | ||
| from torch.testing._internal.distributed.fake_pg import FakeStore |
There was a problem hiding this comment.
There will probably be some failed tests, the distributed imports other than torch.distributed must be gated under a check:
if torch.distributed.is_available():
from torch.distributed._tensor import DeviceMesh, DTensor, Replicate, Shard
from torch.testing._internal.distributed.fake_pg import FakeStore
torch/_dynamo/repro/after_aot.py
Outdated
| fd.write( | ||
| "import torch.distributed as dist\n" | ||
| "from torch.testing._internal.distributed.fake_pg import FakeStore\n" | ||
| ) |
There was a problem hiding this comment.
could you include a generated fx graph runnable into the PR summary? let's make sure the imports still stay at the top of the file, with others
torch/_dynamo/repro/after_aot.py
Outdated
|
|
||
| # Add distributed cleanup after run_repro | ||
| if has_distributed_ops: | ||
| fd.write("dist.destroy_process_group()\n") |
There was a problem hiding this comment.
Also would like to double check with the generated fx graph runnable file, there's no identation for this line?
torch/_dynamo/repro/after_aot.py
Outdated
| # Add distributed cleanup after run_repro if needed | ||
| if has_distributed_ops: | ||
| fd.write("dist.destroy_process_group()\n") | ||
| fd.write(" \n dist.destroy_process_group()\n") |
There was a problem hiding this comment.
nit: take a look at the codegen'd fx graph runnable:
with torch.no_grad():
run_repro(mod, load_args, accuracy=False, command='run', save_dir=None, tracing_mode='real', check_str=None)
dist.destroy_process_group()
# To run it separately, do
# mod, args = run_repro(mod, load_args, accuracy=False, command='get_args', save_dir=None, tracing_mode='real', check_str=None)
# mod(*args)See how the comment below the run_repro is indented? the original intent is likely for people to uncomment those lines if they need them. But now with your added destroy process group, uncommenting those lines would error. Those lines need to run under the no_grad context, so I'd recommend you to move the destroy process group after those comment lines
| from torch.distributed._tensor import DeviceMesh, DTensor, Replicate, Shard | ||
| from torch.testing._internal.distributed.fake_pg import FakeStore | ||
| else: | ||
| # Define dummy classes if distributed is not available |
There was a problem hiding this comment.
Tests using these classes should be skipped when distributed is not available
torch/_dynamo/repro/after_aot.py
Outdated
| ) | ||
|
|
||
| fd.write( |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 5 checks: pull / linux-jammy-py3.9-clang12 / build, pull / linux-jammy-cuda12.8-py3.10-gcc11-build-distributed / build, pull / linux-jammy-py3-clang12-executorch / build, pull / before-test / target-determination, inductor-rocm / rocm-py3.10-inductor / test (inductor, 1, 2, linux.rocm.gpu.2) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…le (#157594) Pull Request resolved: #157594 Approved by: https://github.com/xmfan ghstack dependencies: #157162
Stack from ghstack (oldest at bottom):
Summary:
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames