Refactor allreduce for supporting prefill case#2453
Refactor allreduce for supporting prefill case#2453TennyWang1223 wants to merge 26 commits intomainfrom
Conversation
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
|
Support aiter tensor. Modified the C++ interface where input and output used raw pointers as parameters, changing it to use aiter tensor as parameters. Class pointers and IPCHandle pointers remain unchanged. |
|
MI300 test result
MI308 test result
It looks like medium-sized cases still need optimization on the gfx942. |
|
move "torch.tesnor -> pybind aiter_tesnor_t" to dtypes.py |
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
…_dim Previously the fused allreduce+rmsnorm+quant kernels only supported N=512/1024/2048/4096 via compile-time template dispatch. This made models with other hidden_dim (e.g. GLM-5 N=6144, GPT-OSS N=2880) fall back to the slower non-fused path. Changes: - Convert HIDDEN_DIM/BLOCK_SIZE from template parameter to runtime parameter in 1stage/2stage/split fusion kernels - Use __launch_bounds__(1024,1) with runtime thread count - Fix block_reduce for non-power-of-2 warp counts (round up reduce_width for shfl_xor correctness) - Pad 1stage launch threads to WARP_SIZE multiples with active guard - Use dynamic shared memory for 2stage kernel - Generalize step2 dispatch (local_device_load_rmsnorm) to support any N where n_packs >= 64, removing n_bytes%1024 alignment requirement - Replace silent printf errors with throw for unsupported shapes - Add AITER_AR_1STAGE env override for benchmarking - Improve test_fused_ar_rms.py: add error column, --test flag, multi-shape support, markdown summary table Now supports any N that satisfies: N % pack_size == 0 and N / pack_size <= 1024 (i.e. N <= 8192 for bf16).
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
* fea(ar): refactor custom allreduce Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fea: support prefill Signed-off-by: root <root@hjbog-srdc-24.amd.com> * add latency cmp with rccl Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: remove ck in new kernel Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: ruff check Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: test script format Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: ruff check Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: pa_metadata macro err Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fea(car): support aiter tensor Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fix: move pybind aiter tensor to dtypes.py Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * add aiter_tensor_module * update * update * update * update * update * update * fix: fused_ar_rms gpt n=2880 case Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * [Kernel][Perf] Make allreduce fusion kernels support arbitrary hidden_dim Previously the fused allreduce+rmsnorm+quant kernels only supported N=512/1024/2048/4096 via compile-time template dispatch. This made models with other hidden_dim (e.g. GLM-5 N=6144, GPT-OSS N=2880) fall back to the slower non-fused path. Changes: - Convert HIDDEN_DIM/BLOCK_SIZE from template parameter to runtime parameter in 1stage/2stage/split fusion kernels - Use __launch_bounds__(1024,1) with runtime thread count - Fix block_reduce for non-power-of-2 warp counts (round up reduce_width for shfl_xor correctness) - Pad 1stage launch threads to WARP_SIZE multiples with active guard - Use dynamic shared memory for 2stage kernel - Generalize step2 dispatch (local_device_load_rmsnorm) to support any N where n_packs >= 64, removing n_bytes%1024 alignment requirement - Replace silent printf errors with throw for unsupported shapes - Add AITER_AR_1STAGE env override for benchmarking - Improve test_fused_ar_rms.py: add error column, --test flag, multi-shape support, markdown summary table Now supports any N that satisfies: N % pack_size == 0 and N / pack_size <= 1024 (i.e. N <= 8192 for bf16). * fix: add param support_prefill in ar Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fix: test_fused_ar_rms.py Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fix: test_fused_ar_rms.py Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: amd-ruitang3 <rui.tang2@amd.com> Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com>
* fea(ar): refactor custom allreduce Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fea: support prefill Signed-off-by: root <root@hjbog-srdc-24.amd.com> * add latency cmp with rccl Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: remove ck in new kernel Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: ruff check Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: test script format Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: ruff check Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: pa_metadata macro err Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fea(car): support aiter tensor Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fix: move pybind aiter tensor to dtypes.py Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * add aiter_tensor_module * update * update * update * update * update * update * fix: fused_ar_rms gpt n=2880 case Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * [Kernel][Perf] Make allreduce fusion kernels support arbitrary hidden_dim Previously the fused allreduce+rmsnorm+quant kernels only supported N=512/1024/2048/4096 via compile-time template dispatch. This made models with other hidden_dim (e.g. GLM-5 N=6144, GPT-OSS N=2880) fall back to the slower non-fused path. Changes: - Convert HIDDEN_DIM/BLOCK_SIZE from template parameter to runtime parameter in 1stage/2stage/split fusion kernels - Use __launch_bounds__(1024,1) with runtime thread count - Fix block_reduce for non-power-of-2 warp counts (round up reduce_width for shfl_xor correctness) - Pad 1stage launch threads to WARP_SIZE multiples with active guard - Use dynamic shared memory for 2stage kernel - Generalize step2 dispatch (local_device_load_rmsnorm) to support any N where n_packs >= 64, removing n_bytes%1024 alignment requirement - Replace silent printf errors with throw for unsupported shapes - Add AITER_AR_1STAGE env override for benchmarking - Improve test_fused_ar_rms.py: add error column, --test flag, multi-shape support, markdown summary table Now supports any N that satisfies: N % pack_size == 0 and N / pack_size <= 1024 (i.e. N <= 8192 for bf16). * fix: add param support_prefill in ar Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fix: test_fused_ar_rms.py Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fix: test_fused_ar_rms.py Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: amd-ruitang3 <rui.tang2@amd.com> Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com>
…used AR+RMSNorm
- parallel_state.py: Remove hardcoded hidden_dim allowlist {512,1024,2048,4096}
for 1-stage kernel selection; keep 128KB byte threshold. AITER's C++ dispatch
already gates which dims are supported (ROCm/aiter#2453).
- benchmark_fused_ar_rms_amd.py: Add hidden_dim=2880 (GPT-OSS) to default
decode and prefill shapes.
- test_aiter_allreduce_fusion_amd.py: Add multi-hidden-dim correctness test
covering 2880/4096/5120/6144/7168/8192, and bit-exact residual accuracy
regression test for ROCm/aiter#2586.
- Add PR documentation with A/B test results (GSM8K +2.3pp, TPOT -3.7%).
Made-with: Cursor
…used AR+RMSNorm
- parallel_state.py: Remove hardcoded hidden_dim allowlist {512,1024,2048,4096}
for 1-stage kernel selection; keep 128KB byte threshold. AITER's C++ dispatch
already gates which dims are supported (ROCm/aiter#2453).
- benchmark_fused_ar_rms_amd.py: Add hidden_dim=2880 (GPT-OSS) to default
decode and prefill shapes.
- test_aiter_allreduce_fusion_amd.py: Add multi-hidden-dim correctness test
covering 2880/4096/5120/6144/7168/8192, and bit-exact residual accuracy
regression test for ROCm/aiter#2586.
Made-with: Cursor
|
can we get this PR merged in? @TennyWang1223 cc @zufayu |
|
Hi @TennyWang1223 — sgl-project/sglang#23580 reports an HIP graph capture invalidation in Could you confirm whether this PR's refactor of Tracking issue with full context: #2941 (target v0.1.14). Without this PR (or an equivalent fix), AITER allreduce stays disabled in SGLang production via PR sgl-project/sglang#23581. Thanks! |
|
This PR has already been merged into main. Due to a GitHub bug, it still shows as unmerged here. Therefore, the AITER code used when SGLang reported the bug should already include the changes from this PR, so it shouldn't help resolve the issue. I'll manually close this PR later. As for the bug sgl-project/sglang#23580, I'll go investigate the root cause now. |
|
Hi @TennyWang1223 — small follow-up to confirm intent. The PR shows as closed without merge in both the GitHub UI and the API: Branch Two possibilities:
Either way is fine, just want to make sure downstream consumers reading the closed PR get the right signal. Thanks! |
|
● The PR was actually merged via squash-merge as commit 8cfe5e281 ("Refactor allreduce for supporting prefill case (#2453)"), authored on 2026-04-01. The closed-without-merge state in the PR card appears to be a GitHub Verification (terminal output attached): $ git fetch origin
$ git log origin/main --oneline | grep "#2453"
$ git merge-base --is-ancestor 8cfe5e281 origin/main && echo "on main"
See also: 8cfe5e281 — the main branch tag is shown on the commit page. |

Motivation
Refactor the custom allreduce implementation to decouple its C++ layer from PyTorch and its Python-side IPC exchange from RCCL/gloo, making the module more portable and self-contained. Additionally, increase the max buffer size to support prefill workloads with larger tensors.
Technical Details
1. IPC buffer management refactoring
Introduce
IPCBufferandIPCBufferPoolclasses to encapsulate IPC buffer lifecycle.IPCBufferabstracts over two allocation modes — uncached (hipMalloc) for synchronization metadata and cached (torch.empty) for D2D relay.IPCBufferPoolmanages named buffers and provides IPC handle exchange for both eager mode (pre-registered buffers) and graph mode (dynamically captured addresses).2. Decouple C++ layer from
torch::TensorAll C++ interfaces in
custom_all_reduce.cu,.cuh, and.hare changed fromtorch::Tensorparameters/return values to raw pointers (int64_t/void*), element counts, dtype codes, and explicithipStream_t. The C++ code now compiles without linkinglibtorch. The Python layer extracts primitives viatensor.data_ptr(),tensor.numel(),tensor.dtype, andtorch.cuda.current_stream().cuda_streambefore calling into C++. The_is_weak_contiguouscheck is also moved to the Python side.3. Replace RCCL/gloo-based IPC handle broadcast with TCP store
IPCBufferPool._gather_ipc_metanow usestorch.distributed.TCPStore.set/get(a pure-TCP key-value store) instead ofdist.broadcast_object_list(which routes through gloo collective backend). An assertion verifies the underlying store isTCPStore, ensuring no collective communication backend is involved.store.get()blocks until the key is available, providing natural barrier semantics.4. Increase
max_sizeto support prefillmax_sizeis raised from 128 MB to 1 GB to accommodate prefill-stage tensor sizes.Files changed (8 files, +1042 / -691):
csrc/kernels/custom_all_reduce.cu— full rewrite, torch-free implementationcsrc/include/custom_all_reduce.h— raw pointer interfacescsrc/include/custom_all_reduce.cuh— remove transitive torch dependencycsrc/include/rocm_ops.hpp— update pybind macro signaturescsrc/pybind/custom_all_reduce_pybind.cu— adjust includesaiter/ops/custom_all_reduce.py— Python op stubs with raw pointer typesaiter/dist/device_communicators/custom_all_reduce.py—IPCBuffer,IPCBufferPool, TCPStore exchange, increasedmax_sizeop_tests/multigpu_tests/test_car_rccl_latency.py— latency comparison testTest Plan
test_custom_allreduce.pyon 8× MI355 GPUs with both eager and graph modes to verify correctness (fp16, bf16).test_car_rccl_latency.pyon 8× MI355 GPUs to compare latency against RCCL allreduce.Test Result
Allreduce correctness tests pass on 8× MI355. Latency comparison with RCCL:
AITER custom allreduce matches or outperforms RCCL across all tested sizes on MI355.
Submission Checklist