Skip to content

[SymmMem] Add helpful docstrings for all NVSHMEM APIs #159756

Closed
codingwithsurya wants to merge 15 commits intogh/codingwithsurya/20/basefrom
gh/codingwithsurya/20/head
Closed

[SymmMem] Add helpful docstrings for all NVSHMEM APIs #159756
codingwithsurya wants to merge 15 commits intogh/codingwithsurya/20/basefrom
gh/codingwithsurya/20/head

Conversation

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159756

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 1 Unrelated Failure

As of commit ca54464 with merge base 3daef4d (image):

CANCELLED JOB - The following job was cancelled. Please retry:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

  • pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
    /var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
codingwithsurya added a commit that referenced this pull request Aug 4, 2025
ghstack-source-id: b6609e8
Pull Request resolved: #159756
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
codingwithsurya added a commit that referenced this pull request Aug 4, 2025
ghstack-source-id: 49b2068
Pull Request resolved: #159756
Copy link

@mandroid6 mandroid6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint runner is failing?

@mandroid6
Copy link

Approving land assuming lint issues are resolved!

@codingwithsurya
Copy link
Contributor Author

codingwithsurya commented Aug 4, 2025

lint runner is failing?

yes, updated!

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #159788

1 similar comment
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #159788

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. 






cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #159788

pytorchmergebot pushed a commit that referenced this pull request Aug 8, 2025
…on kernels (#159788)

This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers).

The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data.

-----

**TODO:**
This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until`

From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer

```
Pointer-Based Version:

  Rank 0 → Rank 1:
    Local buffer:   0x430300a00  (src)
    Remote buffer:  0x2430300c00 (dst) ← Rank 1's memory
    Remote signal:  0x2430301600 (sig) ← Rank 1's signal

  Rank 1 (waiting):
    Local signal:   0x430301600 (waits here)

Tensor-Based Version:

  Rank 0 → Rank 1:
    Local buffer:   0x430300a00  (src)
    Local buffer:   0x430300c00  (dst) ← this is wrong
    Local signal:   0x430300e00  (sig) ← this is wrong

  Rank 1 (waiting):
    Local signal:   0x430300e00 (waits here)

```

Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup.

Pull Request resolved: #159788
Approved by: https://github.com/mandroid6, https://github.com/ngimel
ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755, #159756
hinriksnaer pushed a commit to hinriksnaer/pytorch that referenced this pull request Aug 8, 2025
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness.

Pull Request resolved: pytorch#159756
Approved by: https://github.com/mandroid6, https://github.com/ngimel
ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734, pytorch#159755
hinriksnaer pushed a commit to hinriksnaer/pytorch that referenced this pull request Aug 8, 2025
…on kernels (pytorch#159788)

This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers).

The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data.

-----

**TODO:**
This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until`

From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer

```
Pointer-Based Version:

  Rank 0 → Rank 1:
    Local buffer:   0x430300a00  (src)
    Remote buffer:  0x2430300c00 (dst) ← Rank 1's memory
    Remote signal:  0x2430301600 (sig) ← Rank 1's signal

  Rank 1 (waiting):
    Local signal:   0x430301600 (waits here)

Tensor-Based Version:

  Rank 0 → Rank 1:
    Local buffer:   0x430300a00  (src)
    Local buffer:   0x430300c00  (dst) ← this is wrong
    Local signal:   0x430300e00  (sig) ← this is wrong

  Rank 1 (waiting):
    Local signal:   0x430300e00 (waits here)

```

Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup.

Pull Request resolved: pytorch#159788
Approved by: https://github.com/mandroid6, https://github.com/ngimel
ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734, pytorch#159755, pytorch#159756
@YUNQIUGUO
Copy link
Contributor

@codingwithsurya hey! Thank you again for the amazing work of integrating the nvshmem_triton apis! (hopefully I can still reach out to you through gh here!)

one qq - I saw a couple test plans cmds and CI setup seems only work for OSS pytorch. just wondering is nvshmem_triton lib pluggable and usable in FBcode yet?

@codingwithsurya
Copy link
Contributor Author

@codingwithsurya hey! Thank you again for the amazing work of integrating the nvshmem_triton apis! (hopefully I can still reach out to you through gh here!)

one qq - I saw a couple test plans cmds and CI setup seems only work for OSS pytorch. just wondering is nvshmem_triton lib pluggable and usable in FBcode yet?

hey rachel! yep, you can always reach me here on GH! quick clarifier: which test plans / cmds / CI setup are you referring to? for context, i mainly worked on the OSS PyTorch side, so anything you’re seeing should be OSS-focused and shouldn’t be wired up for fbcode.

@YUNQIUGUO
Copy link
Contributor

YUNQIUGUO commented Aug 18, 2025

@codingwithsurya hey! Thank you again for the amazing work of integrating the nvshmem_triton apis! (hopefully I can still reach out to you through gh here!)
one qq - I saw a couple test plans cmds and CI setup seems only work for OSS pytorch. just wondering is nvshmem_triton lib pluggable and usable in FBcode yet?

hey rachel! yep, you can always reach me here on GH! quick clarifier: which test plans / cmds / CI setup are you referring to? for context, i mainly worked on the OSS PyTorch side, so anything you’re seeing should be OSS-focused and shouldn’t be wired up for fbcode.

which test plans / cmds / CI setup are you referring to?

Thanks for the instant reply!!
for example

# python test/distributed/test_nvshmem_triton.py

this test_nvshmem_triton file is not working for buck build (i.e. not integrated in corresponding BUCK test target) yet. the command here I verified works without issue in OSS pytorch repo so no blockers for now!

just wondering because possibly for our overlapping-comp-comm kernel going forward, we'd like to keep a copy version of kernels inside fbsource/fbcode too. so would need to buckify the nvshmem_triton lib so we can further integrate NVSHMEM-based distributed kernels apart from the current symm_mem + triton kernels.

@codingwithsurya
Copy link
Contributor Author

codingwithsurya commented Aug 18, 2025

@codingwithsurya hey! Thank you again for the amazing work of integrating the nvshmem_triton apis! (hopefully I can still reach out to you through gh here!)
one qq - I saw a couple test plans cmds and CI setup seems only work for OSS pytorch. just wondering is nvshmem_triton lib pluggable and usable in FBcode yet?

hey rachel! yep, you can always reach me here on GH! quick clarifier: which test plans / cmds / CI setup are you referring to? for context, i mainly worked on the OSS PyTorch side, so anything you’re seeing should be OSS-focused and shouldn’t be wired up for fbcode.

which test plans / cmds / CI setup are you referring to?

Thanks for the instant reply!! for example e.g.

# python test/distributed/test_nvshmem_triton.py

this test_nvshmem_triton UT file is not working for buck build (i.e. not integrated in corresponding BUCK test target) yet iiuc. the command here I verified works without issue in OSS pytorch repo so no blockers for now!

just wondering because possibly for our overlapping-comp-comm kernel going forward, we'd like to keep a copy version of kernels inside fbsource/fbcode too. so would need to buckify the nvshmem_triton lib so we can further integrate NVSHMEM-based distributed kernels apart from the current symm_mem + triton kernels.

ahh yeah makes sense, it’s only set up for OSS right now. you’ll probably need to add the BUCK targets and buckify the nvshmem_triton lib to use it in fbcode

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness.

Pull Request resolved: pytorch#159756
Approved by: https://github.com/mandroid6, https://github.com/ngimel
ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734, pytorch#159755
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…on kernels (pytorch#159788)

This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers).

The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data.

-----

**TODO:**
This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until`

From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer

```
Pointer-Based Version:

  Rank 0 → Rank 1:
    Local buffer:   0x430300a00  (src)
    Remote buffer:  0x2430300c00 (dst) ← Rank 1's memory
    Remote signal:  0x2430301600 (sig) ← Rank 1's signal

  Rank 1 (waiting):
    Local signal:   0x430301600 (waits here)

Tensor-Based Version:

  Rank 0 → Rank 1:
    Local buffer:   0x430300a00  (src)
    Local buffer:   0x430300c00  (dst) ← this is wrong
    Local signal:   0x430300e00  (sig) ← this is wrong

  Rank 1 (waiting):
    Local signal:   0x430300e00 (waits here)

```

Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup.

Pull Request resolved: pytorch#159788
Approved by: https://github.com/mandroid6, https://github.com/ngimel
ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734, pytorch#159755, pytorch#159756
@github-actions github-actions bot deleted the gh/codingwithsurya/20/head branch September 18, 2025 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (symm_mem) release note label for symmetric memory

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants