[C10D] Report detected failures when emitting collective end events.#109739
[C10D] Report detected failures when emitting collective end events.#109739kumpera wants to merge 16 commits intogh/kumpera/64/basefrom
Conversation
We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109739
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Unrelated FailureAs of commit 152d8c0 with merge base c151163 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
4a43294 to
c4601d7
Compare
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
|
There's one major issue left on this PR: it doesn't work with gloo. |
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
fduwjj
left a comment
There was a problem hiding this comment.
How we ensure WatchDog does not abort the program and let us return error message?
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
| evt.operation = c10d::opTypeToString(work.retrieveOpType()); | ||
| evt.drop_count = 0; | ||
| // isCompleted is mutable :facepalm: | ||
| if (const_cast<Work&>(work).isCompleted() && !work.isSuccess()) |
There was a problem hiding this comment.
@fduwjj I think we should discuss this part
we're calling work.exception() which will query some cuda APIs.
we're doing this from the main watchdog thread.
i think that in today's watchdog design this is already something we're doing in the main loop, so i don't have a big reservation about it.
but if you're redesigning the loops to put cuda accesses on another thread, you'd also want to move this part. would be good to think through how we'll handle the callback reporting error message safely, or else maybe not add the error to the callback in the first place. (I think it should be possible to do though)
…nd events." We leverage the fact that errors get tracked in Work to report them. The error message by itself is not crazy useful and we might need to include some form of error code in the mix (nccl failure, timeout, etc). [ghstack-poisoned]
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack (oldest at bottom):
We leverage the fact that errors get tracked in Work to report them.
The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).