Skip to content

[Distributed] Switch all_reduce to use the new functional collective op#6887

Merged
alanwaketan merged 1 commit intopytorch:masterfrom
yifuwang:all_reduce
Apr 10, 2024
Merged

[Distributed] Switch all_reduce to use the new functional collective op#6887
alanwaketan merged 1 commit intopytorch:masterfrom
yifuwang:all_reduce

Conversation

@yifuwang
Copy link
Copy Markdown
Contributor

@yifuwang yifuwang commented Apr 4, 2024

PyTorch has implemented a new set of functional collective ops and is planning to remove the old ops. Migrating all_reduce to use the new op.

See context in pytorch/pytorch#93173 (comment)

@JackCaoG JackCaoG requested a review from alanwaketan April 4, 2024 17:51
@yifuwang yifuwang marked this pull request as draft April 5, 2024 20:09
Copy link
Copy Markdown
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you need any helps to setup XLA environment?

Comment thread torch_xla/core/xla_model.py Outdated
Comment thread torch_xla/csrc/cross_replica_reduces.cpp Outdated
@yifuwang
Copy link
Copy Markdown
Contributor Author

yifuwang commented Apr 8, 2024

Let me know if you need any helps to setup XLA environment?

Hey @alanwaketan, I'd appreciate your help on this. Is it possible to setup the dev env without having access to docker?

@alanwaketan
Copy link
Copy Markdown
Collaborator

Let me know if you need any helps to setup XLA environment?

Hey @alanwaketan, I'd appreciate your help on this. Is it possible to setup the dev env without having access to docker?

Right...

@alanwaketan
Copy link
Copy Markdown
Collaborator

Let me know if you need any helps to setup XLA environment?

Hey @alanwaketan, I'd appreciate your help on this. Is it possible to setup the dev env without having access to docker?

Have you followed https://github.com/pytorch/xla/blob/master/CONTRIBUTING.md? What error do you get? I wonder may be you can just develop and test this on the cpu machine and no need to access TPU.

@yifuwang
Copy link
Copy Markdown
Contributor Author

yifuwang commented Apr 9, 2024

Have you followed https://github.com/pytorch/xla/blob/master/CONTRIBUTING.md? What error do you get? I wonder may be you can just develop and test this on the cpu machine and no need to access TPU.

Yea I tried to follow it. But building from source requires docker which is not available in my dev env :(

@alanwaketan
Copy link
Copy Markdown
Collaborator

Have you followed master/CONTRIBUTING.md? What error do you get? I wonder may be you can just develop and test this on the cpu machine and no need to access TPU.

Yea I tried to follow it. But building from source requires docker which is not available in my dev env :(

Oh, so you cannot use docker or you cannot access our docker?

@yifuwang
Copy link
Copy Markdown
Contributor Author

yifuwang commented Apr 9, 2024

Oh, so you cannot use docker or you cannot access our docker?

Cannot use docker. It's a restriction of my corporate dev env :(

@alanwaketan
Copy link
Copy Markdown
Collaborator

Maybe you can leave the development to our team then. I don't know how difficult it is to use non-docker env... @JackCaoG Do you know?

PyTorch has implemented a new set of functional collective ops and is planning to remove the old ops. Migrating all_reduce to use the new op.

See context in pytorch/pytorch#93173 (comment)
@yifuwang
Copy link
Copy Markdown
Contributor Author

yifuwang commented Apr 9, 2024

Maybe you can leave the development to our team then.

Sounds good. Thanks for the help @alanwaketan! I just wanted to make sure there wasn't a gap for torch-xla to adopt the new API.

@yifuwang yifuwang marked this pull request as ready for review April 10, 2024 18:48
@yifuwang
Copy link
Copy Markdown
Contributor Author

@alanwaketan - following your advices, the CI is now green.

Note this PR doesn't address the TODO re. generating groups. It merely switches to the new API while being consistent with the old behavior. Happy to continue discussing how to plumb through group information.

Let me know what you think.

@alanwaketan
Copy link
Copy Markdown
Collaborator

Thanks, Yifu. We don't need group information in XLA in general. So we can follow up on that later. Let me trigger the TPU CI. Once everything is green. I will just merge the PR.

Copy link
Copy Markdown
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@alanwaketan
Copy link
Copy Markdown
Collaborator

Oh, we cannot run TPU CI on fork I guess... Never mind. I will just merge it. The head CI will take care of the rest.

@alanwaketan alanwaketan merged commit a816c42 into pytorch:master Apr 10, 2024
yifuwang pushed a commit to pytorch/pytorch that referenced this pull request Apr 10, 2024
… legacy funcol

After pytorch/xla#6887, torch-xla now also uses
the all_reduce from native funcol. So we can remove this logic.

[ghstack-poisoned]
yifuwang pushed a commit to pytorch/pytorch that referenced this pull request Apr 10, 2024
…-xla to use legacy funcol"

After pytorch/xla#6887, torch-xla now also uses
the all_reduce from native funcol. So we can remove this logic.

[ghstack-poisoned]
yifuwang pushed a commit to pytorch/pytorch that referenced this pull request Apr 13, 2024
…hat forces torch-xla to use legacy funcol"

After pytorch/xla#6887, torch-xla now also uses
the all_reduce from native funcol. So we can remove this logic.

[ghstack-poisoned]
yifuwang pushed a commit to pytorch/pytorch that referenced this pull request Apr 13, 2024
…-xla to use legacy funcol"

After pytorch/xla#6887, torch-xla now also uses
the all_reduce from native funcol. So we can remove this logic.

[ghstack-poisoned]
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Apr 13, 2024
… legacy funcol (#123776)

After pytorch/xla#6887, torch-xla now also uses
the all_reduce from native funcol. So we can remove this logic.

Pull Request resolved: #123776
Approved by: https://github.com/wanchaol
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
… legacy funcol (pytorch#123776)

After pytorch/xla#6887, torch-xla now also uses
the all_reduce from native funcol. So we can remove this logic.

Pull Request resolved: pytorch#123776
Approved by: https://github.com/wanchaol
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
… legacy funcol (pytorch#123776)

After pytorch/xla#6887, torch-xla now also uses
the all_reduce from native funcol. So we can remove this logic.

Pull Request resolved: pytorch#123776
Approved by: https://github.com/wanchaol
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants