Test tracing consecutive comms on the same input tensor by mrshenli · Pull Request #84980 · pytorch/pytorch

mrshenli · 2022-09-14T03:34:00Z

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]

pytorch-bot · 2022-09-14T03:34:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84980

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 2 Pending

As of commit bb9024a:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mrshenli · 2022-09-14T03:35:54Z

We can see that the 2nd allreduce properly waited for the completion of the 1st allreduce and the final result properly waited for the 2nd allreduce.

opcode         name                  target                                          args                                                     kwargs
-------------  --------------------  ----------------------------------------------  -------------------------------------------------------  --------
placeholder    x_1                   x_1                                             ()                                                       {}
call_function  add                   aten.add.Tensor                                 (x_1, x_1)                                               {}
get_attr       _tensor_constant0     _tensor_constant0                               ()                                                       {}
get_attr       _tensor_constant1     _tensor_constant1                               ()                                                       {}
call_function  allreduce__default    c10d.allreduce_.default                         ([add], _tensor_constant0, _tensor_constant1, -1)        {}
call_function  comm_result           <function _wrap_comm_result at 0x7f2c12ee90d0>  (allreduce__default,)                                    {}
call_function  getitem               <built-in function getitem>                     (comm_result, 0)                                         {}
call_function  getitem_1             <built-in function getitem>                     (getitem, 0)                                             {}
call_function  wait_comm             <function _wait_comm at 0x7f2c1b25f310>         (getitem_1,)                                             {}
get_attr       _tensor_constant2     _tensor_constant2                               ()                                                       {}
get_attr       _tensor_constant3     _tensor_constant3                               ()                                                       {}
call_function  allreduce__default_1  c10d.allreduce_.default                         ([wait_comm], _tensor_constant2, _tensor_constant3, -1)  {}
call_function  comm_result_1         <function _wrap_comm_result at 0x7f2c12ee90d0>  (allreduce__default_1,)                                  {}
call_function  getitem_3             <built-in function getitem>                     (comm_result_1, 0)                                       {}
call_function  getitem_4             <built-in function getitem>                     (getitem_3, 0)                                           {}
call_function  wait_comm_1           <function _wait_comm at 0x7f2c1b25f310>         (getitem_4,)                                             {}
call_function  mul                   aten.mul.Tensor                                 (wait_comm_1, 2)                                         {}
output         output                output                                          (mul,)                                                   {}

[ghstack-poisoned]

ghstack-source-id: 72ad5d0 Pull Request resolved: #84980

wanchaol · 2022-09-14T05:47:30Z

test/distributed/test_c10d_common.py

+    def _test_consecutive_comm_work_wait(self, tensor):
+        def comm_fn(tensor, group=None):
+            work1 = dist.all_reduce(tensor, group=group, async_op=True)
+            work1.wait()


shall we also have a test case to make this work1.wait() be waited after work2 produced?

I can add that, but I am not sure if we should. Waiting after work2 is produced is only correct because PG internally uses the same stream for communication. If that subtle assumption no longer holds, we could run into race condition, e.g., the 2nd allreduce start reading the memory space before the 1st finishes writing.

Curious, is there a use case that we need to call work1.wait() after work2 is produced?

mrshenli · 2022-09-14T17:21:55Z

@pytorchbot merge -g

pytorchmergebot · 2022-09-14T17:23:17Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the green (-g) flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

Pull Request resolved: #84980 Approved by: https://github.com/wanchaol

Seeing the error for c10d tests when running on 1GPU. Adding the skip when there is insufficient GPU. ``` RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` referring to #84980 Pull Request resolved: #119938 Approved by: https://github.com/eqy, https://github.com/fegin

Support tracing consecutive comms on the same input tensor

6baccb2

[ghstack-poisoned]

mrshenli requested review from H-Huang, awgu, mingzhe09088, pritamdamania87, rohan-varma and zhaojuanmao as code owners September 14, 2022 03:34

mrshenli mentioned this pull request Sep 14, 2022

Remove eager mode support form CommTensor #84978

Closed

pytorch-bot bot added the topic: not user facing topic category label Sep 14, 2022

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 14, 2022

mrshenli requested a review from wanchaol September 14, 2022 03:39

Update on "Support tracing consecutive comms on the same input tensor"

2d7ad03

[ghstack-poisoned]

Update on "Test tracing consecutive comms on the same input tensor"

bb9024a

[ghstack-poisoned]

mrshenli changed the title ~~Support tracing consecutive comms on the same input tensor~~ Test tracing consecutive comms on the same input tensor Sep 14, 2022

mrshenli added a commit that referenced this pull request Sep 14, 2022

Test tracing consecutive comms on the same input tensor

a8b8bf8

ghstack-source-id: 72ad5d0 Pull Request resolved: #84980

wanchaol approved these changes Sep 14, 2022

View reviewed changes

pytorchmergebot added the Merged label Sep 14, 2022

pytorchmergebot closed this in 1a81ab3 Sep 14, 2022

mrshenli mentioned this pull request Sep 15, 2022

Trace local graph and then replace nodes with DT's dispatch graph pytorch/PiPPy#460

Merged

facebook-github-bot deleted the gh/mrshenli/336/head branch September 18, 2022 14:20

mehtanirav pushed a commit that referenced this pull request Oct 4, 2022

Test tracing consecutive comms on the same input tensor (#84980)

e7908d0

Pull Request resolved: #84980 Approved by: https://github.com/wanchaol

tinglvv mentioned this pull request Feb 14, 2024

Fix the skip condition for test_c10d tests #119938

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test tracing consecutive comms on the same input tensor#84980

Test tracing consecutive comms on the same input tensor#84980
mrshenli wants to merge 3 commits intogh/mrshenli/336/basefrom
gh/mrshenli/336/head

mrshenli commented Sep 14, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 14, 2022 •

edited

Loading

Uh oh!

mrshenli commented Sep 14, 2022

Uh oh!

wanchaol Sep 14, 2022

Uh oh!

mrshenli Sep 14, 2022

Uh oh!

mrshenli commented Sep 14, 2022

Uh oh!

pytorchmergebot commented Sep 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mrshenli commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84980

✅ No Failures, 2 Pending

Uh oh!

mrshenli commented Sep 14, 2022

Uh oh!

wanchaol Sep 14, 2022

Choose a reason for hiding this comment

Uh oh!

mrshenli Sep 14, 2022

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Sep 14, 2022

Uh oh!

pytorchmergebot commented Sep 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mrshenli commented Sep 14, 2022 •

edited

Loading

pytorch-bot bot commented Sep 14, 2022 •

edited

Loading