Skip to content

Replacing explicit backend search with api call #144944

Closed
AnantGulati wants to merge 4 commits intopytorch:mainfrom
AnantGulati:AnantGulati_distributed_test_case_changes
Closed

Replacing explicit backend search with api call #144944
AnantGulati wants to merge 4 commits intopytorch:mainfrom
AnantGulati:AnantGulati_distributed_test_case_changes

Conversation

@AnantGulati
Copy link
Contributor

@AnantGulati AnantGulati commented Jan 16, 2025

Following up on PR #138216

In this PR we are modifying the internal structure of common distributed class DistributedTestBase to use an inbuilt function get_default_backend_for_device (#140536).

We have also refactored test file test_functional_api to remove all explicit calls to test_hpu. to add support for a new device to this file a user must only add the device to list of devices. This supports both in-tree and out of tree devices

This PR also adds a setter for world size inside multiprocess test case. This allows for test cases to redefine world size for different test cases in the same class and removes the need to add a new class simply to change properties like world size as shown in test_functional_api

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @kwen2501 @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 16, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144944

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 16, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 16, 2025

❌ 🤖 pytorchbot command failed:

Got EOF while in a quoted string```
Try `@pytorchbot --help` for more info.

@AnantGulati
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@AnantGulati
Copy link
Contributor Author

@kwen2501 could you please review and help land this PR.
Thanks

Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We are very close to land once we clean up a couple points in the comments

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line

Also, how would this setter interact with the world_size function above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should interact with the above function such that it allows us to change world size for only one test case in the test file as shown here: https://github.com/pytorch/pytorch/pull/144944/files/25e3ab6c2c314457ef62b8e610f572e45e7724c7#r1919782578

Comment on lines 575 to 610
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this new test for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a new test, I have simply shifted it to the class TestCollectivesWithDistributedBackend, from it's original class as the only difference between the original class and TestCollectivesWithDistributedBackend is the changed world size.

This class was causing error while refactoring the code so I removed it and added the world size setter which will allow us to modify world size in common distributed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have added this line to change the world size only in this test case, this removes the need of creating a new class just to change one property without impacting the other classes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind reverting the setting of world_size here and restore the original class, OR
use world_size = 4 instead of self.world_size ?

Today the change may be fine, but in case class-based Test fixture is used, any setting to self in one test method would impact the rest tests, because class-based Test fixture will not recreate a new test object per test.

@AnantGulati
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 20, 2025

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@AnantGulati
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased AnantGulati_distributed_test_case_changes onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout AnantGulati_distributed_test_case_changes && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the AnantGulati_distributed_test_case_changes branch from 4feeb3f to a4e58d2 Compare January 20, 2025 10:10
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am about to land this PR, but it seems one small change needs revert. Can you please help? Thank you!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind reverting the setting of world_size here and restore the original class, OR
use world_size = 4 instead of self.world_size ?

Today the change may be fine, but in case class-based Test fixture is used, any setting to self in one test method would impact the rest tests, because class-based Test fixture will not recreate a new test object per test.

@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 10, 2025
@AnantGulati AnantGulati force-pushed the AnantGulati_distributed_test_case_changes branch from 64d64cc to 2003796 Compare March 11, 2025 04:58
@AnantGulati
Copy link
Contributor Author

AnantGulati commented Mar 11, 2025

Do you mind reverting the setting of world_size here and restore the original class, OR
use world_size = 4 instead of self.world_size ?

Today the change may be fine, but in case class-based Test fixture is used, any setting to self in one test method would impact the rest tests, because class-based Test fixture will not recreate a new test object per test.

I have maintained the initial class structure to avoid this problem.

I have also changed the structure so that TestDistributedBackendCollectivesWithWorldSize4 inherits directly from DistributedTestBase.

This is cleaner as the set up steps which were initially present in TestDistributedBackendCollective have been abstracted in DistributedTestBase. This also solves the other errors which we were hitting due to inheriting classes which were intialized in instantiate_device_type_tests.

@kwen2501 I'd love to hear if you think any further adjustments are required. Thanks

@AnantGulati
Copy link
Contributor Author

@kwen2501 Could you please help with this PR

@AnantGulati
Copy link
Contributor Author

@kwen2501 all suggested changes have been added. Can we proceed with this PR?

@AnantGulati
Copy link
Contributor Author

@kwen2501 Could you please take a look at this PR. It has been pending for some time now
Thanks

@guangyey
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased AnantGulati_distributed_test_case_changes onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout AnantGulati_distributed_test_case_changes && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the AnantGulati_distributed_test_case_changes branch from 2003796 to ee5e207 Compare March 21, 2025 06:31
@AnantGulati
Copy link
Contributor Author

@kwen2501 Could you please help with this review. All changes you suggested have been implemented

@pytorchmergebot
Copy link
Collaborator

Successfully rebased AnantGulati_distributed_test_case_changes onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout AnantGulati_distributed_test_case_changes && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the AnantGulati_distributed_test_case_changes branch from c08b452 to 89072b2 Compare May 1, 2025 05:31
@cyyever
Copy link
Collaborator

cyyever commented May 1, 2025

@pytorchmergebot merge -i

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 1, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 0 checks:

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label May 5, 2025
@guangyey
Copy link
Collaborator

guangyey commented May 5, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 5, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@guangyey
Copy link
Collaborator

guangyey commented May 9, 2025

@AnantGulati please resolve the conflicts.

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label May 9, 2025
@guangyey
Copy link
Collaborator

guangyey commented May 9, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/144944/head returned non-zero exit code 1

Rebasing (1/8)
Auto-merging torch/testing/_internal/common_distributed.py
CONFLICT (content): Merge conflict in torch/testing/_internal/common_distributed.py
error: could not apply 59000a74fca... reverting class structure
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 59000a74fca... reverting class structure

Raised by https://github.com/pytorch/pytorch/actions/runs/14922140404

@AnantGulati
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/144944/head returned non-zero exit code 1

Rebasing (1/3)
Auto-merging test/distributed/test_functional_api.py
Auto-merging torch/testing/_internal/common_distributed.py
CONFLICT (content): Merge conflict in torch/testing/_internal/common_distributed.py
error: could not apply 59000a74fca... reverting class structure
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 59000a74fca... reverting class structure

Raised by https://github.com/pytorch/pytorch/actions/runs/15550541252

@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2025

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Aug 9, 2025
@github-actions github-actions bot closed this Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue open source Stale topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants