Skip to content

Support dist.all_gather related collective ops#7860

Merged
zpcore merged 8 commits intomasterfrom
piz/cop
Aug 27, 2024
Merged

Support dist.all_gather related collective ops#7860
zpcore merged 8 commits intomasterfrom
piz/cop

Conversation

@zpcore
Copy link
Copy Markdown
Member

@zpcore zpcore commented Aug 15, 2024

Add dynamo/nondynamo support for torch.distributed.all_reduce and torch.distributed.all_gather_into_tensor.

Motivation: We want to deprecate the collective ops in xla_model.py and be consist with the torch.distributed.

Issue: dist.all_reduct doesn't work with dynamo openxla backend at this time.

@zpcore
Copy link
Copy Markdown
Member Author

zpcore commented Aug 16, 2024

Hi @JackCaoG , I commented out assert met.metric_data("ExecuteTime")[0] == 1 in the test_traceable_collectives.py since the value changed to 3 for all_gather ops due to upstream changes. Shall we enable the test first so we can run the test?

@zpcore zpcore added the tpuci label Aug 16, 2024
@zpcore zpcore marked this pull request as ready for review August 16, 2024 20:02
@zpcore zpcore requested a review from lsy323 August 16, 2024 20:02
@zpcore zpcore changed the title prototype of all_gather related distributed ops Support dist.all_gather related distributed ops Aug 16, 2024
@zpcore zpcore changed the title Support dist.all_gather related distributed ops Support dist.all_gather related collective ops Aug 16, 2024
@zpcore
Copy link
Copy Markdown
Member Author

zpcore commented Aug 16, 2024

I have no idea how this works:

xm.all_reduce(reduce_type, tensors, groups=self._mesh, pin_layout=False)
return _ret_work(tensors)

xm.all_reduce should return a tensor instead of modifying inputs argument.

I will probably give up supporting dist.all_reduce in this PR.

Update: turns out for output_tensor=dist.all_reduce(intput_tensor...)both intput_tensor and output_tensor got updated. We have to use the input_tensor as the final results for the nondynamo path.

Comment thread test/dynamo/test_traceable_collectives.py Outdated
Comment thread torch_xla/distributed/xla_backend.py
Comment thread torch_xla/csrc/tensor_methods.cpp Outdated
@lsy323 lsy323 removed their request for review August 20, 2024 16:26
@zpcore zpcore requested a review from will-cromar August 20, 2024 22:51
Comment thread torch_xla/distributed/xla_backend.py Outdated
Comment thread test/pjrt/test_collective_ops_tpu.py Outdated
Comment thread test/pjrt/test_collective_ops_tpu.py Outdated
Comment thread test/pjrt/test_collective_ops_tpu.py Outdated
Comment thread test/pjrt/test_collective_ops_tpu.py
Comment thread test/pjrt/test_collective_ops_tpu.py Outdated
Comment thread test/pjrt/test_collective_ops_tpu.py Outdated
Comment thread test/pjrt/test_collective_ops_tpu.py
@zpcore zpcore requested a review from will-cromar August 21, 2024 23:12
Comment thread test/pjrt/test_collective_ops_tpu.py Outdated
Comment thread test/pjrt/test_collective_ops_tpu.py Outdated
@zpcore zpcore merged commit f9a706e into master Aug 27, 2024
@zpcore zpcore deleted the piz/cop branch August 27, 2024 16:58
@rpsilva-aws
Copy link
Copy Markdown
Collaborator

Motivation: We want to deprecate the collective ops in xla_model.py and be consist with the torch.distributed.

@zpcore This makes sense, and we would like to help close that gap. Do you have a more descriptive motivation and/or opened issue to migrate the remaining collective ops? I tried finding, but checking in first before I create one. cc: @miladm @tengyifei

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants