Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100413
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 59a0b52: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 3 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 3, 3, macos-m1-12) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase and merge by leaving the following comment on this PR: Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -r |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
abdef1f to
59a0b52
Compare
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Previous timeout log does not print size info. Making it hard to debug hang caused by message size mismatch. (Reason is that when copying `WorkNCCL` object during work enqueue, we don't copy `outputs_` due to reference concern, hence `output.size()` is never triggered.) This PR logs sizes using separate fields, hence not relying on `outputs_`. New timeout log: ``` [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=209715200, NumelOut=1677721600, Timeout(ms)=10000) ran for 10957 milliseconds before timing out. ``` Pull Request resolved: pytorch#100413 Approved by: https://github.com/kumpera
Previous timeout log does not print size info. Making it hard to debug hang caused by message size mismatch.
(Reason is that when copying
WorkNCCLobject during work enqueue, we don't copyoutputs_due to reference concern, henceoutput.size()is never triggered.)This PR logs sizes using separate fields, hence not relying on
outputs_.New timeout log: