Record redistribute_local_tensor in DebugMode#163704
Record redistribute_local_tensor in DebugMode#163704SherlockNoMad wants to merge 5 commits intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163704
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 8b1ee97 with merge base 8d81564 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
zpcore
left a comment
There was a problem hiding this comment.
Nice, I was going to implement capturing the transforminfo but you already did it! Is it possible we have a flag in debug_mode to control this?
Overall LGTM, may need to update the test file.
|
By the way, I hope in DebugMode we can get the list of |
|
I tried looks a bit too verbose? anyway to simplify? |
| for mode in _get_current_dispatch_mode_stack(): | ||
| if isinstance(mode, DebugMode): | ||
| debug_mode = mode | ||
| break |
There was a problem hiding this comment.
done.
get_active_debug_mode
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (distributed, 3, 3, linux.g4dn.12xlarge.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
That's good enough, I will follow up to update the str represestation of the |
Explicit redistribute_local_tensor API call could also results in communication, record it! Pull Request resolved: #163704 Approved by: https://github.com/ezyang
Explicit redistribute_local_tensor API call could also results in communication, record it!
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci