DDP's autograd callbacks should respect CUDA stream

## 🐛 Bug

Rehash of https://github.com/pytorch/pytorch/issues/37860, but creating to discuss whether we should implement this in ddp's reducer.
DDP makes use of autograd engine's post hooks and callbacks in order to synchronize gradients across different workers. However, if the user currently runs their DDP model on a non-default CUDA stream, they can get incorrect and non-deterministic results, since the callback that DDP installs is not guaranteed to run in the same CUDA stream specified by the user. Since operations on different CUDA streams are not automatically synchronized, this results in the issue. Note that this does not apply to the post hooks

## To Reproduce
See https://github.com/pytorch/pytorch/issues/37790 for the repro script.

## Expected behavior

Correct behavior regardless of stream used by the user. 

## Additional context

We can either fix it by using a stream guard set to the correct stream in `c10d/reducer.cpp`, or provide this functionality by default in the autograd engine (tracked here: https://github.com/pytorch/pytorch/issues/37860). It would probably be easier to implement this in the former as the reducer has a better idea of the cuda stream being used. We can close this issue out if we decide to support this in the autograd engine though.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP's autograd callbacks should respect CUDA stream #37944

🐛 Bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DDP's autograd callbacks should respect CUDA stream #37944

Description

🐛 Bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions