Autograd profiler support for torch.distributed

## 🚀 Feature

The [autograd profiler](https://pytorch.org/docs/stable/autograd.html#profiler) does not cover `torch.distributed` ops like allreduce, allgather etc. Adding support for this would be invaluable for debugging performance issues.

We should cover all the collective and point to point operations listed here: https://pytorch.org/docs/stable/distributed.html.

A nice extension for the autograd profiler in the distributed setting would be to have an API like `torch.distributed.combine_profiles`, where a single rank (ex: rank 0) can pull the autograd profiles for all other ranks and provide users with a single chrome trace view which displays the trace across all nodes.

## Motivation

The [autograd profiler](https://pytorch.org/docs/stable/autograd.html#profiler) has been an invaluable tool in PyTorch to debug performance issues and extending this to `torch.distributed` would be beneficial to users. 

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autograd profiler support for torch.distributed #43231

🚀 Feature

Motivation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Autograd profiler support for torch.distributed #43231

Description

🚀 Feature

Motivation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions