[Fix] fix allreduce bug in Piecewise Graph by zyksir · Pull Request #12106 · sgl-project/sglang

zyksir · 2025-10-25T04:39:50Z

Motivation

Previously, When we enable piecewise-cuda-graph,

we might get illegal memory for tp>1
the code won't compile if we disable custom allreduce
This is the command I use to test

# launch server
python3 -m sglang.launch_server --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct --tp 2 --host 0.0.0.0 --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager

# send request
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 4096 --random-output-len 20 --random-range-ratio 1 --num-prompts 10 --max-concurrency 1 --warmup-requests 3

Related PR #11845 #10062

Modifications

There is two bugs here.
The first one is when we disable custom allreduce, we will get error while capturing piecewise cuda graph.
I think this might be that torch.compile cannot include nccl in the graph. Therefore, while I use sglang.inplace_all_reduce to split graph.

Note that sglang.inplace_all_reduce stands for NCCL while sglang.outplace_all_reduce stands for others like customer all reduce.
we cannot use sglang.outplace_all_reduce to split graph, since sglang.outplace_all_reduce will create a new tensor everything, while cuda graph need the input tensor to be fixed.

The second one is that we will get illegal memory for large message size.
I think this is because for customer all reduce, the message size cannot be too large.

When I set max msg size for customer all reduce to the original value(8M), This bug is gone
Actually for large message size, customer all reduce is worse than nccl. since the throughput is worse

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-10-25T04:39:53Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

zyksir · 2025-10-25T04:45:26Z

For now, I disable customer allreduce in the piecewise graph. If not, I will come into this issue: I will leave this as feature TODO. since in prefill, most time the message size is large and the latency of customer allreduce and nccl allreduce is close.

ispobock · 2025-10-25T17:52:30Z

When I run python3 benchmark/gsm8k/bench_sglang.py --parallel 1319 --num-questions 1319 twice, it will get another illegal memory error.

    submod_0 = self.submod_0(l_input_ids_, s72, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_layer_communicator_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_, l_forward_batch_token_to_kv_pool_k_buffer_0_, l_forward_batch_token_to_kv_pool_v_buffer_0_, l_positions_, s80, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_forward_batch_out_cache_loc, s67);  l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_layer_communicator_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_ = l_forward_batch_token_to_kv_pool_k_buffer_0_ = l_forward_batch_token_to_kv_pool_v_buffer_0_ = None
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/compilation/cuda_piecewise_backend.py", line 227, in __call__
    entry.cudagraph.replay()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/graphs.py", line 117, in replay
    super().replay()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

zyksir · 2025-10-26T09:03:26Z

@ispobock The newest commit should fix this problem. I am not sure about the graph generated by torch.compile, it seems to have lots of difference from the cuda graph.

fix piece allreduce bug

0d151f7

zyksir requested review from Ying1123, hnyls2002, ispobock, merrymercy, yizhang2077 and zhyncs as code owners October 25, 2025 04:39

ispobock added the run-ci label Oct 25, 2025

ispobock approved these changes Oct 25, 2025

View reviewed changes

zyksir added 2 commits October 26, 2025 06:22

fix bug when we disable custom allreduce

a655826

explicitly disable custom allreduce in replay

6ac65a4

ispobock merged commit 96a5a94 into sgl-project:main Oct 26, 2025
57 of 70 checks passed

ispobock mentioned this pull request Oct 29, 2025

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Closed

34 tasks

ispobock mentioned this pull request Nov 30, 2025

[Auto Sync] Update backend.py (20251130) #14153

Merged

zyksir deleted the yikai/fix-piece-allreduce-bug branch February 27, 2026 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] fix allreduce bug in Piecewise Graph#12106

[Fix] fix allreduce bug in Piecewise Graph#12106
ispobock merged 3 commits intosgl-project:mainfrom
zyksir:yikai/fix-piece-allreduce-bug

zyksir commented Oct 25, 2025

Uh oh!

gemini-code-assist Bot commented Oct 25, 2025

Uh oh!

zyksir commented Oct 25, 2025

Uh oh!

ispobock commented Oct 25, 2025

Uh oh!

zyksir commented Oct 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zyksir commented Oct 25, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Oct 25, 2025

Uh oh!

zyksir commented Oct 25, 2025

Uh oh!

ispobock commented Oct 25, 2025

Uh oh!

zyksir commented Oct 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants