[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity by codingwithsurya · Pull Request #159136 · pytorch/pytorch

codingwithsurya · 2025-07-25T07:33:11Z

Quick refactor for consistency and clarity.

We now standardize all NVSHMEM data-moving collectives (put, get, alltoall, broadcast) to use their byte-based *_mem_block variants. This makes the API behavior more predictable and avoids mixing paradigms.
Previously, some functions operated on element counts (nelems), while others expected byte sizes but still used nelems as the param name. That inconsistency was easy to miss and could lead to bugs, especially for devs not familiar with the NVSHMEM internals.

To clean this up:
• All byte-based APIs now use nbytes or nbytes_per_pe to make the units explicit.
• Typed APIs consistently use nelems for element counts.
• Docstrings were added or updated to clarify expected units.

Also did some code cleanup — removed unused functions, fixed typos in comments, and did some general housekeeping.

This should make the API more intuitive and reduce friction for developers.

Stack from ghstack (oldest at bottom):

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

…prove code clarity [ghstack-poisoned]

pytorch-bot · 2025-07-25T07:33:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159136

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 1 Unrelated Failure

As of commit a2086b8 with merge base 3daef4d ():

CANCELLED JOB - The following job was cancelled. Please retry:

Limited CI for symmetric memory tests on H100 / linux-jammy-cuda12.8-py3.10-gcc11-sm90-symm / test (h100-symm-mem, 1, 1, linux.aws.h100.4) (gh)

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
/var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…prove code clarity ghstack-source-id: 2adbce6 Pull Request resolved: #159136

…d APIs + improve code clarity" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

…prove code clarity ghstack-source-id: f1cf7e8 Pull Request resolved: #159136

…e-based APIs + improve code clarity" Quick refactor for consistency and clarity. 1. We now standardize all NVSHMEM data-moving collectives (put, get, alltoall, broadcast) to use their byte-based *_mem_block variants. This makes the API behavior more predictable and avoids mixing paradigms. 2. Previously, some functions operated on element counts (nelems), while others expected byte sizes but still used `nelems` as the param name. That inconsistency was easy to miss and could lead to bugs, especially for devs not familiar with the NVSHMEM internals. To clean this up: • All byte-based APIs now use nbytes or nbytes_per_pe to make the units explicit. • Typed APIs consistently use nelems for element counts. • Docstrings were added or updated to clarify expected units. Also did some code cleanup — removed unused functions, fixed typos in comments, and did some general housekeeping. This should make the API more intuitive and reduce friction for developers. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

…prove code clarity ghstack-source-id: cb6c6dc Pull Request resolved: #159136

…e-based APIs + improve code clarity" Quick refactor for consistency and clarity. 1. We now standardize all NVSHMEM data-moving collectives (put, get, alltoall, broadcast) to use their byte-based *_mem_block variants. This makes the API behavior more predictable and avoids mixing paradigms. 2. Previously, some functions operated on element counts (nelems), while others expected byte sizes but still used `nelems` as the param name. That inconsistency was easy to miss and could lead to bugs, especially for devs not familiar with the NVSHMEM internals. To clean this up: • All byte-based APIs now use nbytes or nbytes_per_pe to make the units explicit. • Typed APIs consistently use nelems for element counts. • Docstrings were added or updated to clarify expected units. Also did some code cleanup — removed unused functions, fixed typos in comments, and did some general housekeeping. This should make the API more intuitive and reduce friction for developers. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

…prove code clarity ghstack-source-id: a17309f Pull Request resolved: #159136

…d APIs + improve code clarity" Quick refactor for consistency and clarity. 1. We now standardize all NVSHMEM data-moving collectives (put, get, alltoall, broadcast) to use their byte-based *_mem_block variants. This makes the API behavior more predictable and avoids mixing paradigms. 2. Previously, some functions operated on element counts (nelems), while others expected byte sizes but still used `nelems` as the param name. That inconsistency was easy to miss and could lead to bugs, especially for devs not familiar with the NVSHMEM internals. To clean this up: • All byte-based APIs now use nbytes or nbytes_per_pe to make the units explicit. • Typed APIs consistently use nelems for element counts. • Docstrings were added or updated to clarify expected units. Also did some code cleanup — removed unused functions, fixed typos in comments, and did some general housekeeping. This should make the API more intuitive and reduce friction for developers. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

pytorchmergebot · 2025-08-07T21:06:06Z

Starting merge as part of PR stack under #159788

pytorchmergebot · 2025-08-07T23:47:17Z

Starting merge as part of PR stack under #159788

pytorchmergebot · 2025-08-08T05:19:16Z

Starting merge as part of PR stack under #159788

When playing around with it, I noticed some flakiness in this test across sessions. After debugging, turns out the heavy sync primitives that I was calling (like `nvshmem_quiet()` or `nvshmem_fence()`) from inside Triton kernels was causing deadlocks. The original test tried to guarantee ordering: `put(data) -> fence/quiet -> put(flag)`. But the GPU thread got stuck in `quiet()` waiting for network confirmation while holding the SM, creating a deadlock. The fix was realizing `wait_until` already provides all the sync you need. Just do: - PE A: `nvshmem_wait_until(&ivar, ...)` - PE B: `nvshmem_put(&ivar_on_PE_A, ...)` Pull Request resolved: #159215 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136

…(make device library discoverable + fix peer calculation bug) (#159701) This PR introduces support for Triton 3.4 and resolves several CI and test-related issues. **Triton 3.4 Compatibility** - The JIT post-compile hook has been updated from the legacy JITFunction.compiled_hook to the new API path at triton.knobs.runtime.jit_post_compile_hook. - The internal parameter for kernel semantics in extern function definitions has been updated from _semantic to _builder to align with API changes. **Fix CI Errors** - The new logic inspects the RPATH of libtorch_nvshmem.so to find the NVSHMEM device library, preventing CI tests from being skipped. - Added a decorator to run NVSHMEM tests only on H100s (compatible hardware) **Peer Rank Calculation Fix** - The peer calculation in test_nvshmem_triton.py was changed from peer = (world_size - 1) - rank to peer = 1 - rank. Reasoning: The previous logic was only valid for a 2-rank setup. In the 8-rank CI environment, it incorrectly mapped peers (e.g., rank 0 to 7), breaking tests that assume a 0↔1 communication pattern. This was reproduced and validated on an 8-rank dev setup. Pull Request resolved: #159701 Approved by: https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215

…m in their name (#159734) Previously, a global post-compile hook initialized the NVSHMEM module for all Triton kernels, which was inefficient. This change conditionally initializes `_nvshmemx_cumodule_init(kernel.module)` only for Triton kernels containing "nvshmem" in their name. Also updated the names for all of our nvshmem kernels to align with this. Pull Request resolved: #159734 Approved by: https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701

…tomatic dtype‐based dispatch (#159755) This change introduces a single, generic Triton‐extern wrapper for NVSHMEM team‐based reductions. We now expose one function, `nvshmem.reduce(team, dest, source, nreduce, operation, dtype_id)`, that covers all supported ops (sum, max, min, prod) and dtypes (int8…int64, uint8…uint64, float16, bfloat16, float32, float64). It accepts real dtype objects (torch.dtype or tl.dtype) directly in the Triton kernel launch. Internally, we normalize dtype_id (handling tl.dtype, torch.dtype, str, or constexpr) into the canonical NVSHMEM typename and assemble the proper function name, e.g. nvshmem_float_sum_reduce or nvshmem_bfloat16_prod_reduce Pull Request resolved: #159755 Approved by: https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. Pull Request resolved: #159756 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755

…on kernels (#159788) This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers). The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data. ----- **TODO:** This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until` From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer ``` Pointer-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Remote buffer: 0x2430300c00 (dst) ← Rank 1's memory Remote signal: 0x2430301600 (sig) ← Rank 1's signal Rank 1 (waiting): Local signal: 0x430301600 (waits here) Tensor-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Local buffer: 0x430300c00 (dst) ← this is wrong Local signal: 0x430300e00 (sig) ← this is wrong Rank 1 (waiting): Local signal: 0x430300e00 (waits here) ``` Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup. Pull Request resolved: #159788 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755, #159756

…prove code clarity (pytorch#159136) Quick refactor for consistency and clarity. 1. We now standardize all NVSHMEM data-moving collectives (put, get, alltoall, broadcast) to use their byte-based *_mem_block variants. This makes the API behavior more predictable and avoids mixing paradigms. 2. Previously, some functions operated on element counts (nelems), while others expected byte sizes but still used `nelems` as the param name. That inconsistency was easy to miss and could lead to bugs, especially for devs not familiar with the NVSHMEM internals. To clean this up: • All byte-based APIs now use nbytes or nbytes_per_pe to make the units explicit. • Typed APIs consistently use nelems for element counts. • Docstrings were added or updated to clarify expected units. Also did some code cleanup — removed unused functions, fixed typos in comments, and did some general housekeeping. This should make the API more intuitive and reduce friction for developers. Pull Request resolved: pytorch#159136 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718

When playing around with it, I noticed some flakiness in this test across sessions. After debugging, turns out the heavy sync primitives that I was calling (like `nvshmem_quiet()` or `nvshmem_fence()`) from inside Triton kernels was causing deadlocks. The original test tried to guarantee ordering: `put(data) -> fence/quiet -> put(flag)`. But the GPU thread got stuck in `quiet()` waiting for network confirmation while holding the SM, creating a deadlock. The fix was realizing `wait_until` already provides all the sync you need. Just do: - PE A: `nvshmem_wait_until(&ivar, ...)` - PE B: `nvshmem_put(&ivar_on_PE_A, ...)` Pull Request resolved: pytorch#159215 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136

…(make device library discoverable + fix peer calculation bug) (pytorch#159701) This PR introduces support for Triton 3.4 and resolves several CI and test-related issues. **Triton 3.4 Compatibility** - The JIT post-compile hook has been updated from the legacy JITFunction.compiled_hook to the new API path at triton.knobs.runtime.jit_post_compile_hook. - The internal parameter for kernel semantics in extern function definitions has been updated from _semantic to _builder to align with API changes. **Fix CI Errors** - The new logic inspects the RPATH of libtorch_nvshmem.so to find the NVSHMEM device library, preventing CI tests from being skipped. - Added a decorator to run NVSHMEM tests only on H100s (compatible hardware) **Peer Rank Calculation Fix** - The peer calculation in test_nvshmem_triton.py was changed from peer = (world_size - 1) - rank to peer = 1 - rank. Reasoning: The previous logic was only valid for a 2-rank setup. In the 8-rank CI environment, it incorrectly mapped peers (e.g., rank 0 to 7), breaking tests that assume a 0↔1 communication pattern. This was reproduced and validated on an 8-rank dev setup. Pull Request resolved: pytorch#159701 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215

…m in their name (pytorch#159734) Previously, a global post-compile hook initialized the NVSHMEM module for all Triton kernels, which was inefficient. This change conditionally initializes `_nvshmemx_cumodule_init(kernel.module)` only for Triton kernels containing "nvshmem" in their name. Also updated the names for all of our nvshmem kernels to align with this. Pull Request resolved: pytorch#159734 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701

…tomatic dtype‐based dispatch (pytorch#159755) This change introduces a single, generic Triton‐extern wrapper for NVSHMEM team‐based reductions. We now expose one function, `nvshmem.reduce(team, dest, source, nreduce, operation, dtype_id)`, that covers all supported ops (sum, max, min, prod) and dtypes (int8…int64, uint8…uint64, float16, bfloat16, float32, float64). It accepts real dtype objects (torch.dtype or tl.dtype) directly in the Triton kernel launch. Internally, we normalize dtype_id (handling tl.dtype, torch.dtype, str, or constexpr) into the canonical NVSHMEM typename and assemble the proper function name, e.g. nvshmem_float_sum_reduce or nvshmem_bfloat16_prod_reduce Pull Request resolved: pytorch#159755 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. Pull Request resolved: pytorch#159756 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734, pytorch#159755

…on kernels (pytorch#159788) This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers). The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data. ----- **TODO:** This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until` From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer ``` Pointer-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Remote buffer: 0x2430300c00 (dst) ← Rank 1's memory Remote signal: 0x2430301600 (sig) ← Rank 1's signal Rank 1 (waiting): Local signal: 0x430301600 (waits here) Tensor-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Local buffer: 0x430300c00 (dst) ← this is wrong Local signal: 0x430300e00 (sig) ← this is wrong Rank 1 (waiting): Local signal: 0x430300e00 (waits here) ``` Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup. Pull Request resolved: pytorch#159788 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734, pytorch#159755, pytorch#159756

…prove code clarity (pytorch#159136) Quick refactor for consistency and clarity. 1. We now standardize all NVSHMEM data-moving collectives (put, get, alltoall, broadcast) to use their byte-based *_mem_block variants. This makes the API behavior more predictable and avoids mixing paradigms. 2. Previously, some functions operated on element counts (nelems), while others expected byte sizes but still used `nelems` as the param name. That inconsistency was easy to miss and could lead to bugs, especially for devs not familiar with the NVSHMEM internals. To clean this up: • All byte-based APIs now use nbytes or nbytes_per_pe to make the units explicit. • Typed APIs consistently use nelems for element counts. • Docstrings were added or updated to clarify expected units. Also did some code cleanup — removed unused functions, fixed typos in comments, and did some general housekeeping. This should make the API more intuitive and reduce friction for developers. Pull Request resolved: pytorch#159136 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718

When playing around with it, I noticed some flakiness in this test across sessions. After debugging, turns out the heavy sync primitives that I was calling (like `nvshmem_quiet()` or `nvshmem_fence()`) from inside Triton kernels was causing deadlocks. The original test tried to guarantee ordering: `put(data) -> fence/quiet -> put(flag)`. But the GPU thread got stuck in `quiet()` waiting for network confirmation while holding the SM, creating a deadlock. The fix was realizing `wait_until` already provides all the sync you need. Just do: - PE A: `nvshmem_wait_until(&ivar, ...)` - PE B: `nvshmem_put(&ivar_on_PE_A, ...)` Pull Request resolved: pytorch#159215 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136

…(make device library discoverable + fix peer calculation bug) (pytorch#159701) This PR introduces support for Triton 3.4 and resolves several CI and test-related issues. **Triton 3.4 Compatibility** - The JIT post-compile hook has been updated from the legacy JITFunction.compiled_hook to the new API path at triton.knobs.runtime.jit_post_compile_hook. - The internal parameter for kernel semantics in extern function definitions has been updated from _semantic to _builder to align with API changes. **Fix CI Errors** - The new logic inspects the RPATH of libtorch_nvshmem.so to find the NVSHMEM device library, preventing CI tests from being skipped. - Added a decorator to run NVSHMEM tests only on H100s (compatible hardware) **Peer Rank Calculation Fix** - The peer calculation in test_nvshmem_triton.py was changed from peer = (world_size - 1) - rank to peer = 1 - rank. Reasoning: The previous logic was only valid for a 2-rank setup. In the 8-rank CI environment, it incorrectly mapped peers (e.g., rank 0 to 7), breaking tests that assume a 0↔1 communication pattern. This was reproduced and validated on an 8-rank dev setup. Pull Request resolved: pytorch#159701 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215

…m in their name (pytorch#159734) Previously, a global post-compile hook initialized the NVSHMEM module for all Triton kernels, which was inefficient. This change conditionally initializes `_nvshmemx_cumodule_init(kernel.module)` only for Triton kernels containing "nvshmem" in their name. Also updated the names for all of our nvshmem kernels to align with this. Pull Request resolved: pytorch#159734 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701

…tomatic dtype‐based dispatch (pytorch#159755) This change introduces a single, generic Triton‐extern wrapper for NVSHMEM team‐based reductions. We now expose one function, `nvshmem.reduce(team, dest, source, nreduce, operation, dtype_id)`, that covers all supported ops (sum, max, min, prod) and dtypes (int8…int64, uint8…uint64, float16, bfloat16, float32, float64). It accepts real dtype objects (torch.dtype or tl.dtype) directly in the Triton kernel launch. Internally, we normalize dtype_id (handling tl.dtype, torch.dtype, str, or constexpr) into the canonical NVSHMEM typename and assemble the proper function name, e.g. nvshmem_float_sum_reduce or nvshmem_bfloat16_prod_reduce Pull Request resolved: pytorch#159755 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734

Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. Pull Request resolved: pytorch#159756 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734, pytorch#159755

…on kernels (pytorch#159788) This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers). The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data. ----- **TODO:** This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until` From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer ``` Pointer-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Remote buffer: 0x2430300c00 (dst) ← Rank 1's memory Remote signal: 0x2430301600 (sig) ← Rank 1's signal Rank 1 (waiting): Local signal: 0x430301600 (waits here) Tensor-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Local buffer: 0x430300c00 (dst) ← this is wrong Local signal: 0x430300e00 (sig) ← this is wrong Rank 1 (waiting): Local signal: 0x430300e00 (waits here) ``` Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup. Pull Request resolved: pytorch#159788 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: pytorch#158515, pytorch#158718, pytorch#159136, pytorch#159215, pytorch#159701, pytorch#159734, pytorch#159755, pytorch#159756

[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + im…

2746558

…prove code clarity [ghstack-poisoned]

codingwithsurya mentioned this pull request Jul 22, 2025

[SymmMem] Add NVSHMEM Reduction support (sum, min, max) into Triton #158515

Closed

This was referenced Jul 25, 2025

[ignore] [SymmMem] Add signal_op test with ring broadcast example #158676

Closed

[SymmMem] Use _get_default_group() instead of group.WORLD for group_name access #158718

Closed

pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 25, 2025

codingwithsurya added a commit that referenced this pull request Jul 25, 2025

[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + im…

c987ffd

…prove code clarity ghstack-source-id: 2adbce6 Pull Request resolved: #159136

Update on "[SymmMem] Standardize NVSHMEM Triton wrappers on byte-base…

678bf06

…d APIs + improve code clarity" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]

codingwithsurya added a commit that referenced this pull request Jul 25, 2025

[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + im…

9ebe01a

…prove code clarity ghstack-source-id: f1cf7e8 Pull Request resolved: #159136

codingwithsurya added the release notes: distributed (symm_mem) release note label for symmetric memory label Jul 25, 2025

codingwithsurya self-assigned this Jul 25, 2025

codingwithsurya changed the title ~~[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity~~ [wip] [SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity Jul 25, 2025

codingwithsurya added a commit that referenced this pull request Jul 25, 2025

[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + im…

9714edc

…prove code clarity ghstack-source-id: cb6c6dc Pull Request resolved: #159136

codingwithsurya added a commit that referenced this pull request Jul 25, 2025

[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + im…

ad1baac

…prove code clarity ghstack-source-id: a17309f Pull Request resolved: #159136

codingwithsurya mentioned this pull request Jul 26, 2025

[SymmMem] Fix flaky wait_until test #159215

Closed

codingwithsurya changed the title ~~[wip] [SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity~~ [SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity Jul 31, 2025

codingwithsurya requested review from kwen2501, mandroid6 and ngimel July 31, 2025 23:08

codingwithsurya mentioned this pull request Aug 2, 2025

[SymmMem] Add Triton 3.4 support to NVSHMEM Triton and fix CI tests (make device library discoverable + fix peer calculation bug) #159701

Closed

codingwithsurya added 2 commits August 2, 2025 20:07

codingwithsurya mentioned this pull request Aug 3, 2025

[SymmMem] Initialize NVSHMEM module only for kernels that have nvshmem in their name #159734

Closed

This was referenced Aug 4, 2025

[SymmMem] Refactor NVSHMEM Reduction API to be more ergonomic with automatic dtype‐based dispatch #159755

Closed

[SymmMem] Add helpful docstrings for all NVSHMEM APIs #159756

Closed

codingwithsurya added 2 commits August 6, 2025 16:45

izaitsevfb removed the ciflow/h100-symm-mem label Aug 7, 2025

pytorchmergebot closed this in ea7fe0e Aug 8, 2025

pytorchmergebot added the Merged label Aug 8, 2025

github-actions bot deleted the gh/codingwithsurya/15/head branch September 8, 2025 02:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity#159136

[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity#159136
codingwithsurya wants to merge 16 commits intogh/codingwithsurya/15/basefrom
gh/codingwithsurya/15/head

codingwithsurya commented Jul 25, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 25, 2025 •

edited

Loading

Uh oh!

pytorchmergebot commented Aug 7, 2025

Uh oh!

pytorchmergebot commented Aug 7, 2025

Uh oh!

pytorchmergebot commented Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

codingwithsurya commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159136

❌ 1 Cancelled Job, 1 Unrelated Failure

Uh oh!

pytorchmergebot commented Aug 7, 2025

Uh oh!

pytorchmergebot commented Aug 7, 2025

Uh oh!

pytorchmergebot commented Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codingwithsurya commented Jul 25, 2025 •

edited

Loading

pytorch-bot bot commented Jul 25, 2025 •

edited

Loading