Attach traceback to Exception & Test disatpching process#1036
Attach traceback to Exception & Test disatpching process#1036ejguan wants to merge 4 commits intometa-pytorch:mainfrom
Conversation
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
NivekT
left a comment
There was a problem hiding this comment.
Overall, this looks good to me! I ran the code locally to confirm that it works and the message looks right. It worked.
One thing is that it took awhile to raise the exception (and eventually shutdown). Any reason why?
| "The operation of `round_robin_demux` with `num_instances=1` is an no-op and returns the provided `datapipe` in a list directly" | ||
| ) | ||
| return datapipe | ||
| return [datapipe] |
There was a problem hiding this comment.
Question: In what cases will users intentionally want to set num_instances=1 here?
There was a problem hiding this comment.
This is the case when num_workers=1 with dispatching process.
|
|
||
|
|
||
| def CreateThreadForDataPipeline(datapipe): | ||
| def CreateThreadForDataPipeline(datapipe, name): |
There was a problem hiding this comment.
Forgot to ask before: if this function is used anywhere currently?
There was a problem hiding this comment.
I don't think there is any usage for sure. We can remove it in the future since we might rely on asyncio to achieve threading.
| self.assertTrue("Caught _CustomException in dispatching process" in exc_msg) | ||
| self.assertTrue("Original Traceback" in exc_msg) | ||
| self.assertTrue("_CustomException: oops" in exc_msg) | ||
|
|
There was a problem hiding this comment.
nit: Maybe check if DL2 is shutdown properly? Is that possible to test?
There was a problem hiding this comment.
Actually, DL2 is not properly shutdown for now because we only raise Error in the main process. We haven't handled to shutdown other worker processes when on process has Error. This probably the same reason that it takes longer time to shutdown.
This is tracked in #969
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…h#1036) Summary: Partially fixes meta-pytorch#969 ### Changes - Add `ExceptionWrapper` to attach traceback to the Exception - Reason: traceback is unserializable. So, it has to be passed by string - In order to provide informative Error message, pass name for each process like `dispatching process` and `worker process <id>`. - Add tests to validate Error propagation from the dispatching process - parametrize the tests - Fix a bug for `round_robin_demux` to return a list of DataPipe rather than a single DataPipe when `num_of_instances` is 1. Pull Request resolved: meta-pytorch#1036 Reviewed By: NivekT Differential Revision: D43472709 Pulled By: ejguan fbshipit-source-id: e5c9e581ca881f523fb568b6f46bf16ecfc243d2
Summary: Partially fixes #969 ### Changes - Add `ExceptionWrapper` to attach traceback to the Exception - Reason: traceback is unserializable. So, it has to be passed by string - In order to provide informative Error message, pass name for each process like `dispatching process` and `worker process <id>`. - Add tests to validate Error propagation from the dispatching process - parametrize the tests - Fix a bug for `round_robin_demux` to return a list of DataPipe rather than a single DataPipe when `num_of_instances` is 1. Pull Request resolved: #1036 Reviewed By: NivekT Differential Revision: D43472709 Pulled By: ejguan fbshipit-source-id: e5c9e581ca881f523fb568b6f46bf16ecfc243d2
Partially fixes #969
Changes
ExceptionWrapperto attach traceback to the Exceptiondispatching processandworker process <id>.round_robin_demuxto return a list of DataPipe rather than a single DataPipe whennum_of_instancesis 1.