[inductor] Layout allocator approach for symm_mem graph inputs by tianrengao · Pull Request #175797 · pytorch/pytorch

tianrengao · 2026-02-25T21:48:24Z

Stack from ghstack (oldest at bottom):

Summary:
Replace the Triton identity-copy workaround for graph inputs (InputBuffer)
that need P2P memory with a Layout-based allocator constraint approach.

Three paths in _maybe_realize_symm_mem, in priority order:

We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy)
Graph placeholder (InputBuffer, static shapes) → mark layout.allocator =
AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer +
DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs.
Fallback: insert Triton identity copy with CommBufferLayout (original).

Under CUDAGraph partition mode, only the partition wrapper (CG-recorded)
emits the P2P allocation and .copy_(). The outer wrapper skips entirely so
that:

.copy_() is captured inside the CG recording → ~0µs dispatch on replay
No redundant copy in outer call() → eliminates the self-copy no-op bug
Single empty_strided_p2p in header → no duplicate P2P allocation

CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor
previously used CUDACachingAllocator::raw_alloc for the device-side pointer
arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching
allocator is redirected to a private pool, so these small, long-lived
infrastructure allocations landed in the CG pool and were flagged as
untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always
allocates from the default CUDA pool regardless of the active pool context.
Also added ~CUDAPeerAllocInfo destructor (the old code never freed these
allocations — a pre-existing leak).

Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i,
each shape×mode in separate torchrun to avoid alloc_id collision):

Shape	PR4 no-CG	PR5 no-CG	Δ no-CG	PR4 CG	PR5 CG	Δ CG
8×8	92.9µs	86.8µs	−6.6%	89.6µs	81.3µs	−9.3%
64×64	95.5µs	86.8µs	−9.1%	86.8µs	89.9µs	+3.6%
256×256	86.3µs	87.5µs	+1.4%	91.3µs	87.2µs	−4.5%
512×512	—†	87.4µs	—	89.0µs	81.9µs	−8.0%
1024×1024	144.5µs	138.1µs	−4.4%	154.5µs	154.3µs	−0.1%

† 512×512 PR4 no-CG outlier excluded.

Key results:

no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch)
CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch;
DMA runs on copy engine instead of compute SMs)
CG regression from earlier PR5 versions is fully resolved by the
partition-only guard in codegen_p2p_input_copies()

Key changes:

ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem",
group_name) + Layout.allocator field, propagated through as_fixed()/eq
comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard
wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once)
- prefix-level DMA .copy_() (every call); CG partition guard ensures only
  the partition wrapper emits (not the outer wrapper), so .copy_() is
  CG-captured and replays with ~0µs dispatch
wrapper.py: fix f-string syntax error in empty_strided_p2p codegen
wrapper.py: move proton set_allocator / start() from self.header to
self.prefix — under CUDAGraph partition mode self.header is redirected
into call(), but proton must execute on every call() invocation
CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor
test: Path 2 + Path 3 fallback variant + AllocatorType propagation

Test Plan:

Unit tests:
torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py
-k "LoweringTest" -xvs
→ 9 passed, 1 skipped
Codegen verification (torchrun --nproc_per_node=8):
- no-CG: single empty_strided_p2p in header, single .copy_() in prefix,
  no Triton identity kernel. Correctness check passed (compiled vs eager).
- CG: single empty_strided_p2p in header (no duplicate), single .copy_()
  inside partition_0 (CG-captured), NO .copy_() in outer call().
  Correctness check passed.
Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8,
200 warmup + 200 timed iters per shape. Each (shape × mode) pair in
separate torchrun invocation. Results in table above.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. [ghstack-poisoned]

pytorch-bot · 2026-02-25T21:48:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175797

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 692eea6 with merge base c5dcefd ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. ghstack-source-id: ee6eb75 Pull Request resolved: #175797

…puts" Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. ghstack-source-id: ce9d788 Pull Request resolved: #175797

…puts" Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. ghstack-source-id: 55610af Pull Request resolved: #175797

…puts" Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. ghstack-source-id: b713bf8 Pull Request resolved: #175797

…puts" Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. ghstack-source-id: fdfb366 Pull Request resolved: #175797

…puts" Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. ghstack-source-id: a157c2e Pull Request resolved: #175797

…puts" Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. ghstack-source-id: cb40fc4 Pull Request resolved: #175797

…puts" Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. ghstack-source-id: a9103f3 Pull Request resolved: #175797

…puts" Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. ghstack-source-id: afa568e Pull Request resolved: #175797

…puts" Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. ghstack-source-id: e7edc4b Pull Request resolved: #175797

…puts" Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. ghstack-source-id: b7901d2 Pull Request resolved: #175797

…puts" Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Summary: Replace the Triton identity-copy workaround for graph inputs (InputBuffer) that need P2P memory with a Layout-based allocator constraint approach. Three paths in _maybe_realize_symm_mem, in priority order: 1. We control allocation (ComputedBuffer) → CommBufferLayout (zero-copy) 2. Graph placeholder (InputBuffer, static shapes) → mark layout.allocator = AllocatorType(kind="symm_mem"); wrapper generates persistent P2P buffer + DMA .copy_() inside the CG partition. Runs on copy engine, not compute SMs. 3. Fallback: insert Triton identity copy with CommBufferLayout (original). Under CUDAGraph partition mode, only the partition wrapper (CG-recorded) emits the P2P allocation and .copy_(). The outer wrapper skips entirely so that: - .copy_() is captured inside the CG recording → ~0µs dispatch on replay - No redundant copy in outer call() → eliminates the self-copy no-op bug - Single empty_strided_p2p in header → no duplicate P2P allocation CUDAGraph fix (CUDASymmetricMemory.cu): CUDAPeerAllocInfo's constructor previously used CUDACachingAllocator::raw_alloc for the device-side pointer arrays (buffers_dev_, signal_pads_dev_). During CG tree warmup the caching allocator is redirected to a private pool, so these small, long-lived infrastructure allocations landed in the CG pool and were flagged as untracked leaks by check_memory_pool. Replaced with cudaMalloc, which always allocates from the default CUDA pool regardless of the active pool context. Also added ~CUDAPeerAllocInfo destructor (the old code never freed these allocations — a pre-existing leak). Benchmark (devgpu088, 8×H100, one_shot_all_reduce, 200w+200i, each shape×mode in separate torchrun to avoid alloc_id collision): | Shape | PR4 no-CG | PR5 no-CG | Δ no-CG | PR4 CG | PR5 CG | Δ CG | |-----------|-----------|-----------|-----------|--------|--------|-----------| | 8×8 | 92.9µs | 86.8µs | **−6.6%** | 89.6µs | 81.3µs | **−9.3%** | | 64×64 | 95.5µs | 86.8µs | **−9.1%** | 86.8µs | 89.9µs | +3.6% | | 256×256 | 86.3µs | 87.5µs | +1.4% | 91.3µs | 87.2µs | **−4.5%** | | 512×512 | —† | 87.4µs | — | 89.0µs | 81.9µs | **−8.0%** | | 1024×1024 | 144.5µs | 138.1µs | **−4.4%** | 154.5µs| 154.3µs| −0.1% | † 512×512 PR4 no-CG outlier excluded. Key results: - no-CG: PR5 wins 4-9% (DMA dispatch cheaper than Triton kernel launch) - CG: PR5 wins 4-9% at most shapes (.copy_() CG-captured, ~0µs dispatch; DMA runs on copy engine instead of compute SMs) - CG regression from earlier PR5 versions is fully resolved by the partition-only guard in codegen_p2p_input_copies() Key changes: - ir.py: AllocatorType frozen dataclass (kind="default"/"symm_mem", group_name) + Layout.allocator field, propagated through as_fixed()/__eq__ - comm_lowering.py: 3-path _maybe_realize_symm_mem with is_symbolic guard - wrapper.py: codegen_p2p_input_copies() — header-level P2P alloc (once) + prefix-level DMA .copy_() (every call); CG partition guard ensures only the partition wrapper emits (not the outer wrapper), so .copy_() is CG-captured and replays with ~0µs dispatch - wrapper.py: fix f-string syntax error in empty_strided_p2p codegen - wrapper.py: move proton set_allocator / start() from self.header to self.prefix — under CUDAGraph partition mode self.header is redirected into call(), but proton must execute on every call() invocation - CUDASymmetricMemory.cu/hpp: raw_alloc → cudaMalloc + destructor - test: Path 2 + Path 3 fallback variant + AllocatorType propagation Test Plan: 1. Unit tests: torchrun --nproc_per_node=2 -m pytest test/distributed/test_symmetric_memory.py \ -k "LoweringTest" -xvs → 9 passed, 1 skipped 2. Codegen verification (torchrun --nproc_per_node=8): - no-CG: single empty_strided_p2p in header, single .copy_() in prefix, no Triton identity kernel. Correctness check passed (compiled vs eager). - CG: single empty_strided_p2p in header (no duplicate), single .copy_() inside partition_0 (CG-captured), NO .copy_() in outer call(). Correctness check passed. 3. Benchmark: devgpu088, 8×H100, torchrun --nproc_per_node=8, 200 warmup + 200 timed iters per shape. Each (shape × mode) pair in separate torchrun invocation. Results in table above. ghstack-source-id: 2d5fd44 Pull Request resolved: #175797

tianrengao mentioned this pull request Feb 25, 2026

[inductor] Lower functional symm_mem ops to ExternKernelOut for output buffer reuse #174856

Closed

tianrengao mentioned this pull request Feb 25, 2026

[inductor] symm_mem planning for graph inputs and fallback regions #175449

Closed

pytorch-bot bot added ciflow/h100-symm-mem ciflow/inductor module: inductor release notes: distributed (c10d) release notes category labels Feb 25, 2026

This was referenced Feb 25, 2026

[inductor] CUDAGraph P2P pool handling for symm_mem #175450

Open

[inductor] Hoist output buffer allocations into prior CUDAGraph partition #175476

Open

tianrengao marked this pull request as ready for review February 25, 2026 22:54

tianrengao requested review from kwen2501 February 25, 2026 23:24

pytorch-bot bot added the ciflow/torchtitan Run TorchTitan integration tests label Mar 9, 2026

tianrengao marked this pull request as draft March 20, 2026 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] Layout allocator approach for symm_mem graph inputs#175797

[inductor] Layout allocator approach for symm_mem graph inputs#175797
tianrengao wants to merge 12 commits intogh/tianrengao/26/basefrom
gh/tianrengao/26/head

tianrengao commented Feb 25, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tianrengao commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175797

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tianrengao commented Feb 25, 2026 •

edited

Loading

pytorch-bot bot commented Feb 25, 2026 •

edited

Loading