[SymmMem] Add helpful docstrings for all NVSHMEM APIs by codingwithsurya · Pull Request #159756 · pytorch/pytorch

codingwithsurya · 2025-08-04T05:56:15Z

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness.

Stack from ghstack (oldest at bottom):

[SymmMem] Send tensors with unerased type information to NVSHMEM Triton kernels #159788
-> [SymmMem] Add helpful docstrings for all NVSHMEM APIs #159756
[SymmMem] Refactor NVSHMEM Reduction API to be more ergonomic with automatic dtype‐based dispatch #159755
[SymmMem] Initialize NVSHMEM module only for kernels that have nvshmem in their name #159734
[SymmMem] Add Triton 3.4 support to NVSHMEM Triton and fix CI tests (make device library discoverable + fix peer calculation bug) #159701
[SymmMem] Fix flaky wait_until test #159215
[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity #159136
[SymmMem] Use _get_default_group() instead of group.WORLD for group_name access #158718
[SymmMem] Add NVSHMEM Reduction support (sum, min, max) into Triton #158515

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

[ghstack-poisoned]

pytorch-bot · 2025-08-04T05:56:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159756

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 1 Unrelated Failure

As of commit ca54464 with merge base 3daef4d ():

CANCELLED JOB - The following job was cancelled. Please retry:

Limited CI for symmetric memory tests on H100 / linux-jammy-cuda12.8-py3.10-gcc11-sm90-symm / build (gh)
##[error]The operation was canceled.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
/var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: f00a6c1 Pull Request resolved: #159756

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: b6609e8 Pull Request resolved: #159756

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: 49b2068 Pull Request resolved: #159756

mandroid6

lint runner is failing?

mandroid6 · 2025-08-04T18:00:43Z

Approving land assuming lint issues are resolved!

codingwithsurya · 2025-08-04T18:21:41Z

lint runner is failing?

yes, updated!

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

pytorchmergebot · 2025-08-07T21:06:10Z

Starting merge as part of PR stack under #159788

pytorchmergebot · 2025-08-07T23:47:20Z

Starting merge as part of PR stack under #159788

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

pytorchmergebot · 2025-08-08T05:19:19Z

Starting merge as part of PR stack under #159788

…on kernels (#159788) This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers). The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data. ----- **TODO:** This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until` From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer ``` Pointer-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Remote buffer: 0x2430300c00 (dst) ← Rank 1's memory Remote signal: 0x2430301600 (sig) ← Rank 1's signal Rank 1 (waiting): Local signal: 0x430301600 (waits here) Tensor-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Local buffer: 0x430300c00 (dst) ← this is wrong Local signal: 0x430300e00 (sig) ← this is wrong Rank 1 (waiting): Local signal: 0x430300e00 (waits here) ``` Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup. Pull Request resolved: #159788 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755, #159756

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. Pull Request resolved: pytorch#159756 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734, pytorch#159755

…on kernels (pytorch#159788) This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers). The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data. ----- **TODO:** This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until` From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer ``` Pointer-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Remote buffer: 0x2430300c00 (dst) ← Rank 1's memory Remote signal: 0x2430301600 (sig) ← Rank 1's signal Rank 1 (waiting): Local signal: 0x430301600 (waits here) Tensor-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Local buffer: 0x430300c00 (dst) ← this is wrong Local signal: 0x430300e00 (sig) ← this is wrong Rank 1 (waiting): Local signal: 0x430300e00 (waits here) ``` Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup. Pull Request resolved: pytorch#159788 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734, pytorch#159755, pytorch#159756

YUNQIUGUO · 2025-08-18T05:04:52Z

@codingwithsurya hey! Thank you again for the amazing work of integrating the nvshmem_triton apis! (hopefully I can still reach out to you through gh here!)

one qq - I saw a couple test plans cmds and CI setup seems only work for OSS pytorch. just wondering is nvshmem_triton lib pluggable and usable in FBcode yet?

codingwithsurya · 2025-08-18T05:15:54Z

@codingwithsurya hey! Thank you again for the amazing work of integrating the nvshmem_triton apis! (hopefully I can still reach out to you through gh here!)

one qq - I saw a couple test plans cmds and CI setup seems only work for OSS pytorch. just wondering is nvshmem_triton lib pluggable and usable in FBcode yet?

hey rachel! yep, you can always reach me here on GH! quick clarifier: which test plans / cmds / CI setup are you referring to? for context, i mainly worked on the OSS PyTorch side, so anything you’re seeing should be OSS-focused and shouldn’t be wired up for fbcode.

YUNQIUGUO · 2025-08-18T05:39:49Z

@codingwithsurya hey! Thank you again for the amazing work of integrating the nvshmem_triton apis! (hopefully I can still reach out to you through gh here!)
one qq - I saw a couple test plans cmds and CI setup seems only work for OSS pytorch. just wondering is nvshmem_triton lib pluggable and usable in FBcode yet?

hey rachel! yep, you can always reach me here on GH! quick clarifier: which test plans / cmds / CI setup are you referring to? for context, i mainly worked on the OSS PyTorch side, so anything you’re seeing should be OSS-focused and shouldn’t be wired up for fbcode.

which test plans / cmds / CI setup are you referring to?

Thanks for the instant reply!!
for example

pytorch/test/distributed/test_nvshmem_triton.py

Line 3 in 3c6efd1

# python test/distributed/test_nvshmem_triton.py

this test_nvshmem_triton file is not working for buck build (i.e. not integrated in corresponding BUCK test target) yet. the command here I verified works without issue in OSS pytorch repo so no blockers for now!

just wondering because possibly for our overlapping-comp-comm kernel going forward, we'd like to keep a copy version of kernels inside fbsource/fbcode too. so would need to buckify the nvshmem_triton lib so we can further integrate NVSHMEM-based distributed kernels apart from the current symm_mem + triton kernels.

codingwithsurya · 2025-08-18T05:51:40Z

@codingwithsurya hey! Thank you again for the amazing work of integrating the nvshmem_triton apis! (hopefully I can still reach out to you through gh here!)
one qq - I saw a couple test plans cmds and CI setup seems only work for OSS pytorch. just wondering is nvshmem_triton lib pluggable and usable in FBcode yet?

hey rachel! yep, you can always reach me here on GH! quick clarifier: which test plans / cmds / CI setup are you referring to? for context, i mainly worked on the OSS PyTorch side, so anything you’re seeing should be OSS-focused and shouldn’t be wired up for fbcode.

which test plans / cmds / CI setup are you referring to?

Thanks for the instant reply!! for example e.g.

pytorch/test/distributed/test_nvshmem_triton.py

Line 3 in 3c6efd1

# python test/distributed/test_nvshmem_triton.py

this test_nvshmem_triton UT file is not working for buck build (i.e. not integrated in corresponding BUCK test target) yet iiuc. the command here I verified works without issue in OSS pytorch repo so no blockers for now!

just wondering because possibly for our overlapping-comp-comm kernel going forward, we'd like to keep a copy version of kernels inside fbsource/fbcode too. so would need to buckify the nvshmem_triton lib so we can further integrate NVSHMEM-based distributed kernels apart from the current symm_mem + triton kernels.

ahh yeah makes sense, it’s only set up for OSS right now. you’ll probably need to add the BUCK targets and buckify the nvshmem_triton lib to use it in fbcode

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. Pull Request resolved: pytorch#159756 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734, pytorch#159755

…on kernels (pytorch#159788) This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers). The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data. ----- **TODO:** This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until` From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer ``` Pointer-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Remote buffer: 0x2430300c00 (dst) ← Rank 1's memory Remote signal: 0x2430301600 (sig) ← Rank 1's signal Rank 1 (waiting): Local signal: 0x430301600 (waits here) Tensor-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Local buffer: 0x430300c00 (dst) ← this is wrong Local signal: 0x430300e00 (sig) ← this is wrong Rank 1 (waiting): Local signal: 0x430300e00 (waits here) ``` Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup. Pull Request resolved: pytorch#159788 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734, pytorch#159755, pytorch#159756

[SymmMem] add helpful docstrings

69348c0

[ghstack-poisoned]

codingwithsurya mentioned this pull request Aug 4, 2025

[SymmMem] Add NVSHMEM Reduction support (sum, min, max) into Triton #158515

Closed

This was referenced Aug 4, 2025

[SymmMem] Use _get_default_group() instead of group.WORLD for group_name access #158718

Closed

[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity #159136

Closed

pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 4, 2025

codingwithsurya added a commit that referenced this pull request Aug 4, 2025

[SymmMem] add helpful docstrings

8f55322

ghstack-source-id: f00a6c1 Pull Request resolved: #159756

codingwithsurya requested a review from mandroid6 August 4, 2025 06:01

codingwithsurya self-assigned this Aug 4, 2025

codingwithsurya added the release notes: distributed (symm_mem) release note label for symmetric memory label Aug 4, 2025

codingwithsurya changed the title ~~[SymmMem] add helpful docstrings~~ [SymmMem] Add helpful docstrings for all NVSHMEM APIs Aug 4, 2025

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

89c6720

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

codingwithsurya added a commit that referenced this pull request Aug 4, 2025

[SymmMem] add helpful docstrings

f9f7544

ghstack-source-id: b6609e8 Pull Request resolved: #159756

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

85449ec

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

codingwithsurya added a commit that referenced this pull request Aug 4, 2025

[SymmMem] add helpful docstrings

53f1415

ghstack-source-id: 49b2068 Pull Request resolved: #159756

mandroid6 reviewed Aug 4, 2025

View reviewed changes

mandroid6 approved these changes Aug 4, 2025

View reviewed changes

codingwithsurya mentioned this pull request Aug 4, 2025

[SymmMem] Send tensors with unerased type information to NVSHMEM Triton kernels #159788

Closed

codingwithsurya added 5 commits August 5, 2025 13:57

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

eebd522

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

f37ff13

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

bef8d13

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

1466b34

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

7ec6765

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

codingwithsurya added 5 commits August 6, 2025 10:46

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

53dd68e

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

72d5aea

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

0b6b837

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

686de6d

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

5093754

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

codingwithsurya removed the ciflow/h100-symm-mem label Aug 7, 2025

ngimel approved these changes Aug 8, 2025

View reviewed changes

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

7c77557

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

pytorch-bot bot added the ciflow/h100-symm-mem label Aug 8, 2025

codingwithsurya removed the ciflow/h100-symm-mem label Aug 8, 2025

Update on "[SymmMem] Add helpful docstrings for all NVSHMEM APIs "

ca54464

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

pytorch-bot bot added the ciflow/h100-symm-mem label Aug 8, 2025

codingwithsurya removed the ciflow/h100-symm-mem label Aug 8, 2025

pytorchmergebot added the Merged label Aug 8, 2025

pytorchmergebot closed this in e0d8a31 Aug 8, 2025

github-actions bot deleted the gh/codingwithsurya/20/head branch September 18, 2025 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SymmMem] Add helpful docstrings for all NVSHMEM APIs #159756

[SymmMem] Add helpful docstrings for all NVSHMEM APIs #159756
codingwithsurya wants to merge 15 commits intogh/codingwithsurya/20/basefrom
gh/codingwithsurya/20/head

codingwithsurya commented Aug 4, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 4, 2025 •

edited

Loading

Uh oh!

mandroid6 left a comment

Uh oh!

mandroid6 commented Aug 4, 2025

Uh oh!

codingwithsurya commented Aug 4, 2025 •

edited

Loading

Uh oh!

pytorchmergebot commented Aug 7, 2025

Uh oh!

pytorchmergebot commented Aug 7, 2025

Uh oh!

pytorchmergebot commented Aug 8, 2025

Uh oh!

YUNQIUGUO commented Aug 18, 2025

Uh oh!

codingwithsurya commented Aug 18, 2025

Uh oh!

YUNQIUGUO commented Aug 18, 2025 •

edited

Loading

Uh oh!

codingwithsurya commented Aug 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

codingwithsurya commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159756

❌ 1 Cancelled Job, 1 Unrelated Failure

Uh oh!

mandroid6 left a comment

Choose a reason for hiding this comment

Uh oh!

mandroid6 commented Aug 4, 2025

Uh oh!

codingwithsurya commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorchmergebot commented Aug 7, 2025

Uh oh!

pytorchmergebot commented Aug 7, 2025

Uh oh!

pytorchmergebot commented Aug 8, 2025

Uh oh!

YUNQIUGUO commented Aug 18, 2025

Uh oh!

codingwithsurya commented Aug 18, 2025

Uh oh!

YUNQIUGUO commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codingwithsurya commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codingwithsurya commented Aug 4, 2025 •

edited

Loading

pytorch-bot bot commented Aug 4, 2025 •

edited

Loading

codingwithsurya commented Aug 4, 2025 •

edited

Loading

YUNQIUGUO commented Aug 18, 2025 •

edited

Loading

codingwithsurya commented Aug 18, 2025 •

edited

Loading