Skip to content

[SymmMem] Increase signal pad size for NVL72#162026

Closed
kwen2501 wants to merge 1 commit intogh/kwen2501/226/basefrom
gh/kwen2501/226/head
Closed

[SymmMem] Increase signal pad size for NVL72#162026
kwen2501 wants to merge 1 commit intogh/kwen2501/226/basefrom
gh/kwen2501/226/head

Conversation

@kwen2501
Copy link
Collaborator

@kwen2501 kwen2501 commented Sep 2, 2025

Stack from ghstack (oldest at bottom):

so that the signal calls do not step on each other's foot.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162026

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b6bcad5 with merge base 524b78d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501 added a commit that referenced this pull request Sep 2, 2025
ghstack-source-id: c4f0bd7
Pull-Request-resolved: #162026
@pytorch-bot pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Sep 2, 2025
@kwen2501 kwen2501 requested a review from ngimel September 2, 2025 23:56
// Covers NVL72
constexpr int max_cuda_p2p_domain_size = 72;
// Maximum number of channels
constexpr int symm_max_nblocks = 32;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we ever increase blocks to 32?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to do so soon for our ops on Blackwell (NV18) due to its large bw.
I will pull a PR on that later.

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Sep 4, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 4, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
so that the signal calls do not step on each other's foot.

Pull Request resolved: pytorch#162026
Approved by: https://github.com/ngimel
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
so that the signal calls do not step on each other's foot.

Pull Request resolved: pytorch#162026
Approved by: https://github.com/ngimel
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
so that the signal calls do not step on each other's foot.

Pull Request resolved: pytorch#162026
Approved by: https://github.com/ngimel
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
so that the signal calls do not step on each other's foot.

Pull Request resolved: pytorch#162026
Approved by: https://github.com/ngimel
@github-actions github-actions bot deleted the gh/kwen2501/226/head branch October 5, 2025 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/h100-symm-mem ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants