Refactor allreduce for supporting prefill case by TennyWang1223 · Pull Request #2453 · ROCm/aiter

TennyWang1223 · 2026-03-24T10:12:55Z

Motivation

Refactor the custom allreduce implementation to decouple its C++ layer from PyTorch and its Python-side IPC exchange from RCCL/gloo, making the module more portable and self-contained. Additionally, increase the max buffer size to support prefill workloads with larger tensors.

Technical Details

1. IPC buffer management refactoring
Introduce IPCBuffer and IPCBufferPool classes to encapsulate IPC buffer lifecycle. IPCBuffer abstracts over two allocation modes — uncached (hipMalloc) for synchronization metadata and cached (torch.empty) for D2D relay. IPCBufferPool manages named buffers and provides IPC handle exchange for both eager mode (pre-registered buffers) and graph mode (dynamically captured addresses).
2. Decouple C++ layer from torch::Tensor
All C++ interfaces in custom_all_reduce.cu, .cuh, and .h are changed from torch::Tensor parameters/return values to raw pointers (int64_t / void*), element counts, dtype codes, and explicit hipStream_t. The C++ code now compiles without linking libtorch. The Python layer extracts primitives via tensor.data_ptr(), tensor.numel(), tensor.dtype, and torch.cuda.current_stream().cuda_stream before calling into C++. The _is_weak_contiguous check is also moved to the Python side.
3. Replace RCCL/gloo-based IPC handle broadcast with TCP store
IPCBufferPool._gather_ipc_meta now uses torch.distributed.TCPStore.set/get (a pure-TCP key-value store) instead of dist.broadcast_object_list (which routes through gloo collective backend). An assertion verifies the underlying store is TCPStore, ensuring no collective communication backend is involved. store.get() blocks until the key is available, providing natural barrier semantics.
4. Increase max_size to support prefill
max_size is raised from 128 MB to 1 GB to accommodate prefill-stage tensor sizes.
Files changed (8 files, +1042 / -691):

csrc/kernels/custom_all_reduce.cu — full rewrite, torch-free implementation
csrc/include/custom_all_reduce.h — raw pointer interfaces
csrc/include/custom_all_reduce.cuh — remove transitive torch dependency
csrc/include/rocm_ops.hpp — update pybind macro signatures
csrc/pybind/custom_all_reduce_pybind.cu — adjust includes
aiter/ops/custom_all_reduce.py — Python op stubs with raw pointer types
aiter/dist/device_communicators/custom_all_reduce.py — IPCBuffer, IPCBufferPool, TCPStore exchange, increased max_size
op_tests/multigpu_tests/test_car_rccl_latency.py — latency comparison test

Test Plan

Run test_custom_allreduce.py on 8× MI355 GPUs with both eager and graph modes to verify correctness (fp16, bf16).
Run test_car_rccl_latency.py on 8× MI355 GPUs to compare latency against RCCL allreduce.

Test Result

Allreduce correctness tests pass on 8× MI355. Latency comparison with RCCL:

Size	Shape	AITER (us)	RCCL (us)
128 MB	(8192, 8192)	788.7	867.3
256 MB	(16384, 8192)	1472.8	1535.0
512 MB	(32768, 8192)	2841.8	2872.5
1 GB	(65536, 8192)	5547.2	5545.2

AITER custom allreduce matches or outperforms RCCL across all tested sizes on MI355.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

github-actions · 2026-03-24T10:13:41Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-355`	Run Triton tests on MI355 in addition to MI325
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2453 --add-label <label>

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

TennyWang1223 · 2026-03-27T01:58:04Z

Support aiter tensor. Modified the C++ interface where input and output used raw pointers as parameters, changing it to use aiter tensor as parameters. Class pointers and IPCHandle pointers remain unchanged.

TennyWang1223 · 2026-03-27T03:22:23Z

MI300 test result

Size	Shape	AITER (us)	RCCL (us)
128 MB	(8192, 8192)	974.4	841.2
256 MB	(16384, 8192)	1910.3	1598.7
512 MB	(32768, 8192)	3792.6	3132.2
1 GB	(65536, 8192)	6120.2	6126.5

MI308 test result

Size	Shape	AITER (us)	RCCL (us)
128 MB	(8192, 8192)	1056.7	949.5
256 MB	(16384, 8192)	2059.3	1743.2
512 MB	(32768, 8192)	4075.7	3344.4
1 GB	(65536, 8192)	6592.5	6598.0

It looks like medium-sized cases still need optimization on the gfx942.

amd-ruitang3 · 2026-03-27T03:23:25Z

move "torch.tesnor -> pybind aiter_tesnor_t" to dtypes.py

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

…_dim Previously the fused allreduce+rmsnorm+quant kernels only supported N=512/1024/2048/4096 via compile-time template dispatch. This made models with other hidden_dim (e.g. GLM-5 N=6144, GPT-OSS N=2880) fall back to the slower non-fused path. Changes: - Convert HIDDEN_DIM/BLOCK_SIZE from template parameter to runtime parameter in 1stage/2stage/split fusion kernels - Use __launch_bounds__(1024,1) with runtime thread count - Fix block_reduce for non-power-of-2 warp counts (round up reduce_width for shfl_xor correctness) - Pad 1stage launch threads to WARP_SIZE multiples with active guard - Use dynamic shared memory for 2stage kernel - Generalize step2 dispatch (local_device_load_rmsnorm) to support any N where n_packs >= 64, removing n_bytes%1024 alignment requirement - Replace silent printf errors with throw for unsupported shapes - Add AITER_AR_1STAGE env override for benchmarking - Improve test_fused_ar_rms.py: add error column, --test flag, multi-shape support, markdown summary table Now supports any N that satisfies: N % pack_size == 0 and N / pack_size <= 1024 (i.e. N <= 8192 for bf16).

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

* fea(ar): refactor custom allreduce Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fea: support prefill Signed-off-by: root <root@hjbog-srdc-24.amd.com> * add latency cmp with rccl Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: remove ck in new kernel Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: ruff check Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: test script format Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: ruff check Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: pa_metadata macro err Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fea(car): support aiter tensor Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fix: move pybind aiter tensor to dtypes.py Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * add aiter_tensor_module * update * update * update * update * update * update * fix: fused_ar_rms gpt n=2880 case Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * [Kernel][Perf] Make allreduce fusion kernels support arbitrary hidden_dim Previously the fused allreduce+rmsnorm+quant kernels only supported N=512/1024/2048/4096 via compile-time template dispatch. This made models with other hidden_dim (e.g. GLM-5 N=6144, GPT-OSS N=2880) fall back to the slower non-fused path. Changes: - Convert HIDDEN_DIM/BLOCK_SIZE from template parameter to runtime parameter in 1stage/2stage/split fusion kernels - Use __launch_bounds__(1024,1) with runtime thread count - Fix block_reduce for non-power-of-2 warp counts (round up reduce_width for shfl_xor correctness) - Pad 1stage launch threads to WARP_SIZE multiples with active guard - Use dynamic shared memory for 2stage kernel - Generalize step2 dispatch (local_device_load_rmsnorm) to support any N where n_packs >= 64, removing n_bytes%1024 alignment requirement - Replace silent printf errors with throw for unsupported shapes - Add AITER_AR_1STAGE env override for benchmarking - Improve test_fused_ar_rms.py: add error column, --test flag, multi-shape support, markdown summary table Now supports any N that satisfies: N % pack_size == 0 and N / pack_size <= 1024 (i.e. N <= 8192 for bf16). * fix: add param support_prefill in ar Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fix: test_fused_ar_rms.py Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fix: test_fused_ar_rms.py Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: amd-ruitang3 <rui.tang2@amd.com> Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com>

…used AR+RMSNorm - parallel_state.py: Remove hardcoded hidden_dim allowlist {512,1024,2048,4096} for 1-stage kernel selection; keep 128KB byte threshold. AITER's C++ dispatch already gates which dims are supported (ROCm/aiter#2453). - benchmark_fused_ar_rms_amd.py: Add hidden_dim=2880 (GPT-OSS) to default decode and prefill shapes. - test_aiter_allreduce_fusion_amd.py: Add multi-hidden-dim correctness test covering 2880/4096/5120/6144/7168/8192, and bit-exact residual accuracy regression test for ROCm/aiter#2586. - Add PR documentation with A/B test results (GSM8K +2.3pp, TPOT -3.7%). Made-with: Cursor

…used AR+RMSNorm - parallel_state.py: Remove hardcoded hidden_dim allowlist {512,1024,2048,4096} for 1-stage kernel selection; keep 128KB byte threshold. AITER's C++ dispatch already gates which dims are supported (ROCm/aiter#2453). - benchmark_fused_ar_rms_amd.py: Add hidden_dim=2880 (GPT-OSS) to default decode and prefill shapes. - test_aiter_allreduce_fusion_amd.py: Add multi-hidden-dim correctness test covering 2880/4096/5120/6144/7168/8192, and bit-exact residual accuracy regression test for ROCm/aiter#2586. Made-with: Cursor

sunway513 · 2026-04-28T19:46:25Z

can we get this PR merged in? @TennyWang1223 cc @zufayu

sunway513 · 2026-04-28T19:49:08Z

Hi @TennyWang1223 — sgl-project/sglang#23580 reports an HIP graph capture invalidation in AiterCustomAllreduce (helper kernel launched on an internal stream during capture → hipErrorStreamCaptureInvalidated → SIGABRT all 8 TP procs at decode replay).

Could you confirm whether this PR's refactor of custom_all_reduce.cu / .cuh already addresses that helper-kernel-on-internal-stream issue, or whether a follow-up commit would be needed before merge?

Tracking issue with full context: #2941 (target v0.1.14). Without this PR (or an equivalent fix), AITER allreduce stays disabled in SGLang production via PR sgl-project/sglang#23581.

Thanks!

TennyWang1223 · 2026-04-29T03:55:38Z

This PR has already been merged into main. Due to a GitHub bug, it still shows as unmerged here. Therefore, the AITER code used when SGLang reported the bug should already include the changes from this PR, so it shouldn't help resolve the issue. I'll manually close this PR later. As for the bug sgl-project/sglang#23580, I'll go investigate the root cause now.
@sunway513

sunway513 · 2026-04-30T02:47:42Z

Hi @TennyWang1223 — small follow-up to confirm intent. The PR shows as closed without merge in both the GitHub UI and the API:

state: CLOSED
mergedAt: null
closedAt: 2026-04-29T04:47:49Z

Branch refactor_ar (head 733e87fb) is currently diverged from main: ahead by 26, behind by 258, and grepping main's commit log for "#2453" or "Refactor allreduce" returns no matches.

Two possibilities:

The squash-merge landed under a different commit message and the GitHub link to Refactor allreduce for supporting prefill case #2453 just got lost — in which case could you point me to the squash commit SHA so I can verify it's reachable from main?
You closed without merge intentionally (e.g. the refactor became unnecessary after the ROCm 7.2.1 runtime fix) — in which case great, we'll close [Track v0.1.14] Fix AiterCustomAllreduce HIP graph capture invalidation (SGLang #23580) #2941 with that conclusion and update the v0.1.13-rc1 release notes accordingly.

Either way is fine, just want to make sure downstream consumers reading the closed PR get the right signal. Thanks!

TennyWang1223 · 2026-04-30T03:12:08Z

● The PR was actually merged via squash-merge as commit 8cfe5e281 ("Refactor allreduce for supporting prefill case (#2453)"), authored on 2026-04-01. The closed-without-merge state in the PR card appears to be a GitHub
UI artifact.

Verification (terminal output attached):

$ git fetch origin
$ git log origin/main --oneline | grep "#2453"
$ git merge-base --is-ancestor 8cfe5e281 origin/main && echo "on main"

See also: 8cfe5e281 — the main branch tag is shown on the commit page.

root added 3 commits March 24, 2026 09:50

fea(ar): refactor custom allreduce

d49998d

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

fea: support prefill

22e6539

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

add latency cmp with rccl

ed3e160

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

TennyWang1223 requested review from a team and valarLip March 24, 2026 10:12

root added 5 commits March 25, 2026 03:58

fix: remove ck in new kernel

1ec263d

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

fix: ruff check

85ab1f7

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

fix: test script format

ea883dc

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

fix: ruff check

72297be

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

fix: pa_metadata macro err

8a8e3c2

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

valarLip added the ci:all label Mar 25, 2026

valarLip and others added 4 commits March 25, 2026 20:13

Merge branch 'main' into refactor_ar

2d68579

Merge branch 'main' into refactor_ar

c0befc2

Merge branch 'main' into refactor_ar

e1ec080

fea(car): support aiter tensor

7da6919

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

TennyWang1223 requested a review from amd-ruitang3 March 26, 2026 10:19

TennyWang1223 and others added 10 commits March 27, 2026 03:56

fix: move pybind aiter tensor to dtypes.py

b92d67d

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

add aiter_tensor_module

cb98137

Merge branch 'main' into refactor_ar

3a3aa34

update

5a4bfc6

update

ee0dedd

update

41cfc97

update

0e9b4d2

update

66a0e32

update

ed87d51

fix: fused_ar_rms gpt n=2880 case

9c321f7

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

valarLip and others added 4 commits March 31, 2026 02:48

fix: add param support_prefill in ar

e94cded

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

fix: test_fused_ar_rms.py

270142b

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

fix: test_fused_ar_rms.py

733e87f

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

valarLip approved these changes Mar 31, 2026

View reviewed changes

hubertlu-tw mentioned this pull request Apr 3, 2026

[AMD] Simplify fused allreduce + RMSNorm and remove hidden_dim allowlist sgl-project/sglang#21986

Merged

hubertlu-tw mentioned this pull request Apr 6, 2026

[AITER-Upgrade] PR readiness sgl-project/sglang#21302

Open

sunway513 mentioned this pull request Apr 28, 2026

[Track v0.1.14] Fix AiterCustomAllreduce HIP graph capture invalidation (SGLang #23580) #2941

Closed

TennyWang1223 closed this Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor allreduce for supporting prefill case#2453

Refactor allreduce for supporting prefill case#2453
TennyWang1223 wants to merge 26 commits intomainfrom
refactor_ar

TennyWang1223 commented Mar 24, 2026

Uh oh!

github-actions Bot commented Mar 24, 2026

Uh oh!

TennyWang1223 commented Mar 27, 2026

Uh oh!

TennyWang1223 commented Mar 27, 2026

Uh oh!

amd-ruitang3 commented Mar 27, 2026

Uh oh!

sunway513 commented Apr 28, 2026

Uh oh!

sunway513 commented Apr 28, 2026

Uh oh!

TennyWang1223 commented Apr 29, 2026

Uh oh!

sunway513 commented Apr 30, 2026

Uh oh!

TennyWang1223 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

TennyWang1223 commented Mar 24, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented Mar 24, 2026

🏷️ CI Guide

Uh oh!

TennyWang1223 commented Mar 27, 2026

Uh oh!

TennyWang1223 commented Mar 27, 2026

Uh oh!

amd-ruitang3 commented Mar 27, 2026

Uh oh!

sunway513 commented Apr 28, 2026

Uh oh!

sunway513 commented Apr 28, 2026

Uh oh!

TennyWang1223 commented Apr 29, 2026

Uh oh!

sunway513 commented Apr 30, 2026

Uh oh!

TennyWang1223 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants