Support torch.distributed.scatter collective#9365
Conversation
pgmoka
left a comment
There was a problem hiding this comment.
Mostly questions, and a requests for extra documentation
pgmoka
left a comment
There was a problem hiding this comment.
Follow-up seems good. Let me know if you have any questions on https://github.com/pytorch/xla/pull/9365/files#r2151351304.
Otherwise, LGTM
pgmoka
left a comment
There was a problem hiding this comment.
Follow-up seems good. Let me know if you have any questions on https://github.com/pytorch/xla/pull/9365/files#r2151351304.
Otherwise, LGTM.
One minor thing: I believe the tests failing are due to flakyness. Can you confirm?
The TPU test failure is a known flake and was cleared up by re-running. The Torchprime e2e test failure is probably real but because PRs from forks aren't exercising that test it's very difficult to tell where the failure comes from |
#9315
XLA doesn't have a distributed Scatter op but we can put dummy tensor lists on the non-source rank and use reduce_scatter