Skip to content

[Fix] fix allreduce bug in Piecewise Graph#12106

Merged
ispobock merged 3 commits intosgl-project:mainfrom
zyksir:yikai/fix-piece-allreduce-bug
Oct 26, 2025
Merged

[Fix] fix allreduce bug in Piecewise Graph#12106
ispobock merged 3 commits intosgl-project:mainfrom
zyksir:yikai/fix-piece-allreduce-bug

Conversation

@zyksir
Copy link
Copy Markdown
Collaborator

@zyksir zyksir commented Oct 25, 2025

Motivation

Previously, When we enable piecewise-cuda-graph,

  • we might get illegal memory for tp>1
  • the code won't compile if we disable custom allreduce
    This is the command I use to test
# launch server
python3 -m sglang.launch_server --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct --tp 2 --host 0.0.0.0 --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager

# send request
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 4096 --random-output-len 20 --random-range-ratio 1 --num-prompts 10 --max-concurrency 1 --warmup-requests 3 

Related PR #11845 #10062

Modifications

There is two bugs here.
The first one is when we disable custom allreduce, we will get error while capturing piecewise cuda graph.
I think this might be that torch.compile cannot include nccl in the graph. Therefore, while I use sglang.inplace_all_reduce to split graph.

  • Note that sglang.inplace_all_reduce stands for NCCL while sglang.outplace_all_reduce stands for others like customer all reduce.
  • we cannot use sglang.outplace_all_reduce to split graph, since sglang.outplace_all_reduce will create a new tensor everything, while cuda graph need the input tensor to be fixed.

The second one is that we will get illegal memory for large message size.
I think this is because for customer all reduce, the message size cannot be too large.

  • When I set max msg size for customer all reduce to the original value(8M), This bug is gone
  • Actually for large message size, customer all reduce is worse than nccl. since the throughput is worse

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@zyksir
Copy link
Copy Markdown
Collaborator Author

zyksir commented Oct 25, 2025

For now, I disable customer allreduce in the piecewise graph. If not, I will come into this issue: I will leave this as feature TODO. since in prefill, most time the message size is large and the latency of customer allreduce and nccl allreduce is close.
image

@ispobock
Copy link
Copy Markdown
Collaborator

When I run python3 benchmark/gsm8k/bench_sglang.py --parallel 1319 --num-questions 1319 twice, it will get another illegal memory error.

    submod_0 = self.submod_0(l_input_ids_, s72, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_layer_communicator_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_, l_forward_batch_token_to_kv_pool_k_buffer_0_, l_forward_batch_token_to_kv_pool_v_buffer_0_, l_positions_, s80, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_forward_batch_out_cache_loc, s67);  l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_layer_communicator_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_ = l_forward_batch_token_to_kv_pool_k_buffer_0_ = l_forward_batch_token_to_kv_pool_v_buffer_0_ = None
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/compilation/cuda_piecewise_backend.py", line 227, in __call__
    entry.cudagraph.replay()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/graphs.py", line 117, in replay
    super().replay()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@zyksir
Copy link
Copy Markdown
Collaborator Author

zyksir commented Oct 26, 2025

@ispobock The newest commit should fix this problem. I am not sure about the graph generated by torch.compile, it seems to have lots of difference from the cuda graph.

@ispobock ispobock merged commit 96a5a94 into sgl-project:main Oct 26, 2025
57 of 70 checks passed
@zyksir zyksir deleted the yikai/fix-piece-allreduce-bug branch February 27, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants