Skip to content

Commit 8292e2e

Browse files
committed
Update on "[inductor] CUDAGraph P2P pool handling for symm_mem"
Summary: When symm_mem P2P tensors (allocated via empty_strided_p2p with alloc_id) are inputs to a CUDAGraph partition, the cudagraph tree must handle them specially: 1. p2p_input_idxs: detected during node initialization via _has_Standard_Deleter check, added to static_input_idxs so they are passed through without copying into the cudagraph pool (which would lose the P2P property) and their pointer stability is validated on replay. 2. check_memory_pool: filters out P2P allocations (non-standard deleter) before validating against the cudagraph pool, since P2P buffers use cuMemCreate/cuMemMap and are not managed by the CUDA caching allocator. 3. dealloc_current_path_weakrefs: skips standard-deleter assertion for P2P storage wrappers. 4. test_external_allocation_fallback updated: now expects success (auto copy to P2P) instead of RuntimeError, with codegen and runtime correctness checks. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D93914969 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo mlazos [ghstack-poisoned]
2 parents 8431109 + 06e741f commit 8292e2e

3 files changed

Lines changed: 25 additions & 22 deletions

File tree

test/distributed/test_symmetric_memory.py

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1593,14 +1593,10 @@ def func(x, w):
15931593
@skip_if_rocm_multiprocess # requires registered-buffer support
15941594
@skip_if_lt_x_gpu(2)
15951595
@fresh_inductor_cache()
1596-
def test_cudagraph_p2p_input_passthrough(self):
1596+
def test_one_shot_all_reduce_with_cudagraph(self):
15971597
"""
1598-
Verify that when a symm_mem collective's input is a cudagraph-managed
1599-
tensor from a prior compiled graph, the P2P tensor is correctly passed
1600-
through the cudagraph tree without being copied to the regular pool.
1601-
1602-
This tests the p2p_input_idxs mechanism in CUDAGraphNode that adds P2P
1603-
inputs to static_input_idxs so they are not re-allocated.
1598+
Verify one_shot_all_reduce correctness under CUDAGraph
1599+
record + replay (mode="reduce-overhead").
16041600
"""
16051601
self._init_process()
16061602

torch/_inductor/cudagraph_trees.py

Lines changed: 22 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -921,19 +921,15 @@ def __init__(
921921
if isinstance(t, torch.Tensor) and self._is_cuda_graph_recorded_tensor(t)
922922
]
923923

924-
# P2P symmetric memory inputs (allocated via empty_strided_p2p with
925-
# alloc_id). These have stable addresses and are passed through
926-
# without copying into the cudagraph pool.
927-
#
928-
# Detection: P2P buffers are allocated via cuMemCreate/cuMemMap (not
929-
# the CUDA caching allocator), so they lack a standard deleter.
930-
# TODO: Replace with a positive is_p2p check on StorageImpl (requires C++ change).
924+
# P2P symmetric memory inputs are not from the caching allocator.
925+
# They are allocated via empty_strided_p2p and have stable addresses.
926+
# Add them to static_input_idxs to prevent re-allocation from the cudagraph pool.
931927
self.p2p_input_idxs: list[int] = [
932928
idx
933929
for idx, t in enumerate(inputs)
934930
if isinstance(t, torch.Tensor)
935931
and t.is_cuda
936-
and not torch._C._has_Standard_Deleter(t.untyped_storage()._cdata)
932+
and _is_external_storage(t.untyped_storage()._cdata)
937933
]
938934

939935
# (depth, offset) of live tensors which are alias of previous graph outputs
@@ -1872,6 +1868,17 @@ def format_tb(frames: list[Any]) -> str:
18721868
return "".join(traceback.format_list(formatted_traceback))
18731869

18741870

1871+
def _is_external_storage(storage_cdata: int) -> bool:
1872+
"""Check if a storage is not allocated by the CUDA caching allocator.
1873+
In the cudagraph tree, all standard CUDA tensors use the caching
1874+
allocator's deleter (raw_deleter). External allocations such as P2P
1875+
symmetric memory (via cuMemCreate) use a different deleter, so we
1876+
can distinguish them with _has_Standard_Deleter.
1877+
TODO: add a positive is_p2p / is_external flag on StorageImpl so we
1878+
don't rely on deleter identity."""
1879+
return not torch._C._has_Standard_Deleter(storage_cdata)
1880+
1881+
18751882
def check_memory_pool(
18761883
device: int,
18771884
pool_id: tuple[int, int],
@@ -1884,10 +1891,10 @@ def check_memory_pool(
18841891
storage_ptr = stor()
18851892
if storage_ptr is None:
18861893
continue
1887-
# Skip non-pool allocations (e.g., P2P symmetric memory buffers allocated
1888-
# via cuMemCreate/cuMemMap). These are not managed by the CUDA caching
1889-
# allocator and should not be validated against the cudagraph pool.
1890-
if not torch._C._has_Standard_Deleter(storage_ptr):
1894+
# Skip non-pool allocations, for example, P2P symmetric memory buffers allocated via cuMemCreate.
1895+
# They are not managed by the CUDA caching allocator and should not be validated
1896+
# against the cudagraph pool.
1897+
if _is_external_storage(storage_ptr):
18911898
continue
18921899
unique_storages.add(stor.data_ptr())
18931900

@@ -2731,8 +2738,9 @@ def apply_checkpoint_execution_state_in_allocator(self) -> None:
27312738
for wrapper in live_storages_wrappers:
27322739
storage_ptr = wrapper()
27332740
assert storage_ptr is not None
2734-
# Skip non-pool allocations (e.g., P2P symmetric memory buffers)
2735-
if torch._C._has_Standard_Deleter(storage_ptr):
2741+
# P2P storages are not in the cudagraph pool, so
2742+
# skip the deallocation check for them.
2743+
if not _is_external_storage(storage_ptr):
27362744
assert wrapper.data_ptr() not in ptrs_to_deallocate
27372745

27382746
def live_cudagraph_pool_storages_in_curr_execution(

torch/csrc/distributed/c10d/symm_mem/CUDASymmetricMemory.cu

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,6 @@ CUDAPeerAllocInfo::~CUDAPeerAllocInfo() {
111111
if (is_finalizing()) {
112112
return;
113113
}
114-
// Best-effort free -- ignore errors during process teardown.
115114
c10::cuda::CUDAGuard guard(local_device_idx_);
116115
(void)cudaFree(buffers_dev_);
117116
(void)cudaFree(signal_pads_dev_);

0 commit comments

Comments
 (0)