🐛 Bug
Rehash of #37860, but creating to discuss whether we should implement this in ddp's reducer.
DDP makes use of autograd engine's post hooks and callbacks in order to synchronize gradients across different workers. However, if the user currently runs their DDP model on a non-default CUDA stream, they can get incorrect and non-deterministic results, since the callback that DDP installs is not guaranteed to run in the same CUDA stream specified by the user. Since operations on different CUDA streams are not automatically synchronized, this results in the issue. Note that this does not apply to the post hooks
To Reproduce
See #37790 for the repro script.
Expected behavior
Correct behavior regardless of stream used by the user.
Additional context
We can either fix it by using a stream guard set to the correct stream in c10d/reducer.cpp, or provide this functionality by default in the autograd engine (tracked here: #37860). It would probably be easier to implement this in the former as the reducer has a better idea of the cuda stream being used. We can close this issue out if we decide to support this in the autograd engine though.
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar
🐛 Bug
Rehash of #37860, but creating to discuss whether we should implement this in ddp's reducer.
DDP makes use of autograd engine's post hooks and callbacks in order to synchronize gradients across different workers. However, if the user currently runs their DDP model on a non-default CUDA stream, they can get incorrect and non-deterministic results, since the callback that DDP installs is not guaranteed to run in the same CUDA stream specified by the user. Since operations on different CUDA streams are not automatically synchronized, this results in the issue. Note that this does not apply to the post hooks
To Reproduce
See #37790 for the repro script.
Expected behavior
Correct behavior regardless of stream used by the user.
Additional context
We can either fix it by using a stream guard set to the correct stream in
c10d/reducer.cpp, or provide this functionality by default in the autograd engine (tracked here: #37860). It would probably be easier to implement this in the former as the reducer has a better idea of the cuda stream being used. We can close this issue out if we decide to support this in the autograd engine though.cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar