Skip to content

Autograd profiler support for torch.distributed #43231

@pritamdamania87

Description

@pritamdamania87

🚀 Feature

The autograd profiler does not cover torch.distributed ops like allreduce, allgather etc. Adding support for this would be invaluable for debugging performance issues.

We should cover all the collective and point to point operations listed here: https://pytorch.org/docs/stable/distributed.html.

A nice extension for the autograd profiler in the distributed setting would be to have an API like torch.distributed.combine_profiles, where a single rank (ex: rank 0) can pull the autograd profiles for all other ranks and provide users with a single chrome trace view which displays the trace across all nodes.

Motivation

The autograd profiler has been an invaluable tool in PyTorch to debug performance issues and extending this to torch.distributed would be beneficial to users.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski

Metadata

Metadata

Assignees

Labels

oncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions