Test tracing consecutive comms on the same input tensor#84980
Test tracing consecutive comms on the same input tensor#84980mrshenli wants to merge 3 commits intogh/mrshenli/336/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84980
Note: Links to docs will display an error until the docs builds have been completed. ✅ No Failures, 2 PendingAs of commit bb9024a: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
We can see that the 2nd allreduce properly waited for the completion of the 1st allreduce and the final result properly waited for the 2nd allreduce. |
[ghstack-poisoned]
[ghstack-poisoned]
| def _test_consecutive_comm_work_wait(self, tensor): | ||
| def comm_fn(tensor, group=None): | ||
| work1 = dist.all_reduce(tensor, group=group, async_op=True) | ||
| work1.wait() |
There was a problem hiding this comment.
shall we also have a test case to make this work1.wait() be waited after work2 produced?
There was a problem hiding this comment.
I can add that, but I am not sure if we should. Waiting after work2 is produced is only correct because PG internally uses the same stream for communication. If that subtle assumption no longer holds, we could run into race condition, e.g., the 2nd allreduce start reading the memory space before the 1st finishes writing.
Curious, is there a use case that we need to call work1.wait() after work2 is produced?
|
@pytorchbot merge -g |
|
@pytorchbot successfully started a merge job. Check the current status here. |
Pull Request resolved: #84980 Approved by: https://github.com/wanchaol
Seeing the error for c10d tests when running on 1GPU. Adding the skip when there is insufficient GPU. ``` RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` referring to #84980 Pull Request resolved: #119938 Approved by: https://github.com/eqy, https://github.com/fegin
Stack from ghstack (oldest at bottom):