Skip to content

[c10] Move P2P access logic from ATen to c10#174582

Closed
ngimel wants to merge 2 commits intogh/ngimel/1/basefrom
gh/ngimel/1/head
Closed

[c10] Move P2P access logic from ATen to c10#174582
ngimel wants to merge 2 commits intogh/ngimel/1/basefrom
gh/ngimel/1/head

Conversation

@ngimel
Copy link
Copy Markdown
Collaborator

@ngimel ngimel commented Feb 9, 2026

Stack from ghstack (oldest at bottom):

Refactor PeerToPeerAccess by moving the core implementation from
aten/src/ATen/cuda to c10/cuda. This makes P2P and fabric access
queries available at the c10 layer without requiring ATen dependencies.
The ATen layer now provides thin wrappers that ensure CUDA lazy
initialization before forwarding to c10. This separation allows
lower-level CUDA code to query P2P capabilities without pulling in
ATen context machinery.

Original diff by @minsii #173571

Bifferential Revision: D92675476

Refactor PeerToPeerAccess by moving the core implementation from
aten/src/ATen/cuda to c10/cuda. This makes P2P and fabric access
queries available at the c10 layer without requiring ATen dependencies.
The ATen layer now provides thin wrappers that ensure CUDA lazy
initialization before forwarding to c10. This separation allows
lower-level CUDA code to query P2P capabilities without pulling in
ATen context machinery.

Differential Revision: [D92675476](https://our.internmc.facebook.com/intern/diff/D92675476/)

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Feb 9, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174582

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 72 Cancelled Jobs, 3 Unrelated Failures

As of commit 42cf315 with merge base a4c4cc8 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ngimel added a commit that referenced this pull request Feb 9, 2026
Refactor PeerToPeerAccess by moving the core implementation from
aten/src/ATen/cuda to c10/cuda. This makes P2P and fabric access
queries available at the c10 layer without requiring ATen dependencies.
The ATen layer now provides thin wrappers that ensure CUDA lazy
initialization before forwarding to c10. This separation allows
lower-level CUDA code to query P2P capabilities without pulling in
ATen context machinery.

Differential Revision: [D92675476](https://our.internmc.facebook.com/intern/diff/D92675476/)

ghstack-source-id: 339391993
Pull Request resolved: #174582
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Feb 9, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Refactor PeerToPeerAccess by moving the core implementation from
aten/src/ATen/cuda to c10/cuda. This makes P2P and fabric access
queries available at the c10 layer without requiring ATen dependencies.
The ATen layer now provides thin wrappers that ensure CUDA lazy
initialization before forwarding to c10. This separation allows
lower-level CUDA code to query P2P capabilities without pulling in
ATen context machinery.

Differential Revision: [D92675476](https://our.internmc.facebook.com/intern/diff/D92675476/)

[ghstack-poisoned]
ngimel added a commit that referenced this pull request Feb 9, 2026
Pull Request resolved: #174582

Refactor PeerToPeerAccess by moving the core implementation from
aten/src/ATen/cuda to c10/cuda. This makes P2P and fabric access
queries available at the c10 layer without requiring ATen dependencies.
The ATen layer now provides thin wrappers that ensure CUDA lazy
initialization before forwarding to c10. This separation allows
lower-level CUDA code to query P2P capabilities without pulling in
ATen context machinery.
ghstack-source-id: 339394212
@exported-using-ghexport

Differential Revision: [D92675476](https://our.internmc.facebook.com/intern/diff/D92675476/)
num_devices_ >= 0,
"p2p access cache not initialized. "
"Ensure c10::cuda::detail::init_p2p_access_cache() is called first.");
TORCH_CHECK(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these should TORCH_CHECK_VALUE but that should be another PR

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 9, 2026
@ngimel
Copy link
Copy Markdown
Collaborator Author

ngimel commented Feb 9, 2026

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: Not merging any PRs at the moment because there is a merge blocking https://github.com/pytorch/pytorch/labels/ci:%20sev issue open at:
#174600

Details for Dev Infra team Raised by workflow job

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Feb 9, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@eqy eqy added the topic: not user facing topic category label Feb 9, 2026
@eqy
Copy link
Copy Markdown
Collaborator

eqy commented Feb 9, 2026

@pytorchmergebot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: Not merging any PRs at the moment because there is a merge blocking https://github.com/pytorch/pytorch/labels/ci:%20sev issue open at:
#174600

Details for Dev Infra team Raised by workflow job

@ngimel
Copy link
Copy Markdown
Collaborator Author

ngimel commented Feb 9, 2026

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team Raised by workflow job

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team Raised by workflow job

@seemethere
Copy link
Copy Markdown
Member

@pytorchbot merge -f 'merged internally'

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorch-auto-revert
Copy link
Copy Markdown

@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert

This PR is attributed to have caused regression in:

Please investigate and fix the issues.
@claude Can you please read this revert comment, follow the links and read the errors, to then give a brief diagnostics on the cause of the error? If you judge the error to be legitimate reason for a revert, please provide brief guidance on how the author could fix it.

@claude
Copy link
Copy Markdown

claude bot commented Feb 10, 2026

Claude finished @pytorch-auto-revert[bot]'s task in 42s —— View job


Analyzing Revert Request

  • Read the CI failure logs to understand the error
  • Diagnose the cause of the failure
  • Provide guidance on how to fix the issue

Diagnosis

The revert is legitimate. The CI failure is a simple lint error caused by trailing whitespace.

File: torch/utils/hipify/cuda_to_hip_mappings.py
Line: 3416
Issue: The line added by this PR has 4 trailing spaces after the comma:

    ("c10/cuda/PeerToPeerAccess.h", "c10/hip/PeerToPeerAccess.h"),    ← trailing spaces here

The linter (SPACES rule) correctly flagged this as an error.

How to Fix

  1. Remove the trailing spaces from line 3416 in torch/utils/hipify/cuda_to_hip_mappings.py
  2. The line should end with just the comma, no trailing whitespace:
    ("c10/cuda/PeerToPeerAccess.h", "c10/hip/PeerToPeerAccess.h"),
  3. Run lintrunner -a locally to auto-fix this and verify no other lint issues before re-submitting

This is a trivial fix - just remove the trailing spaces and the PR should pass CI.


@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Feb 10, 2026
This reverts commit c1bc0e9.

Reverted #174582 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#174582 (comment)))
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@ngimel your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Feb 10, 2026
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: Lint / lintrunner-noclang-all / linux-job, trunk / macos-py3-arm64 / build, Meta Internal-Only Changes Check

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

radeksm pushed a commit to radeksm/pytorch that referenced this pull request Feb 20, 2026
Refactor PeerToPeerAccess by moving the core implementation from
aten/src/ATen/cuda to c10/cuda. This makes P2P and fabric access
queries available at the c10 layer without requiring ATen dependencies.
The ATen layer now provides thin wrappers that ensure CUDA lazy
initialization before forwarding to c10. This separation allows
lower-level CUDA code to query P2P capabilities without pulling in
ATen context machinery.

Original diff by @minsii pytorch#173571

Bifferential Revision: [D92675476](https://our.internmc.facebook.com/intern/diff/D92675476/)
Pull Request resolved: pytorch#174582
Approved by: https://github.com/Skylion007
radeksm pushed a commit to radeksm/pytorch that referenced this pull request Feb 20, 2026
This reverts commit c1bc0e9.

Reverted pytorch#174582 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#174582 (comment)))
@github-actions github-actions bot deleted the gh/ngimel/1/head branch March 13, 2026 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants