Skip to content

DDP's autograd callbacks should respect CUDA stream #37944

@rohan-varma

Description

@rohan-varma

🐛 Bug

Rehash of #37860, but creating to discuss whether we should implement this in ddp's reducer.
DDP makes use of autograd engine's post hooks and callbacks in order to synchronize gradients across different workers. However, if the user currently runs their DDP model on a non-default CUDA stream, they can get incorrect and non-deterministic results, since the callback that DDP installs is not guaranteed to run in the same CUDA stream specified by the user. Since operations on different CUDA streams are not automatically synchronized, this results in the issue. Note that this does not apply to the post hooks

To Reproduce

See #37790 for the repro script.

Expected behavior

Correct behavior regardless of stream used by the user.

Additional context

We can either fix it by using a stream guard set to the correct stream in c10d/reducer.cpp, or provide this functionality by default in the autograd engine (tracked here: #37860). It would probably be easier to implement this in the former as the reducer has a better idea of the cuda stream being used. We can close this issue out if we decide to support this in the autograd engine though.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar

Metadata

Metadata

Assignees

Labels

oncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions