[WIP][Feature] support tp-sp on qwen2/3 & deepseek v2/3/3.2 by randgun · Pull Request #12820 · sgl-project/sglang

randgun · 2025-11-07T07:29:16Z

Motivation

For the classic dense decode layer structure (self-attention + MLP), in the pure TP case, tensors are parallelized in the attention layer and the MLP layer. Since each device contains the full amount of data before and after each TP split, Layernorm stores 2*2BSH bytes of activation value (assume hidden states shape is [B, S, H]). SP aims to split this 4*BSH redundant data across multiple devices, and do layernorm independently. For more details please refer to the research paper https://arxiv.org/pdf/2205.05198.pdf

We add a parameter "--enable-sp" at server args, which can enable sequence parallel if be set. For example

python3 -m sglang.launch_server --model-path $MODEL_PATH \
        --tp-size 16 --dp-size 1 --enable-sp \
        --trust-remote-code --attention-backend ascend --device npu --host $HOST_IP --port $PORT \
        --quantization w8a8_int8 --mem-fraction-static 0.8 \
        --chunked-prefill-size 16000 --context-length 16000 --max-prefill-tokens 16000 --max-total-tokens 16000 \
        --disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto

The TP-SP has two benifits:

Reduce long dataset (36K) TTFT for 10% on deepseek v3 and 7% on deepseek v3.2.
Reduce peak memory because of less activations on RMSNorm layer.

Modifications

For the RowParallel linear, we replace all-reduce comm op with reduce-scatter on TP group if enable sp.
For the dense layers of model (qwen2/3 & deepseek v2/3/3.2), we split residual before layernorm at the first layer, hidden states is be splitted at RowParallel linear so do not need to split again.
After the first dense layer, all data are scattered state, only do layernorm at prepare_attn and prepare_mlp.
Before the MLP and ATTENTION module, we do extra all-gather because of weights has been splitted on tensor dimension, to make sure inputs are complete.
Do all-gather at the last dense layer.
For sparse layer of deepseek models, when enableed Deepep, the hidden states has been sscattered, we utilize it and move the _scattered_to_tp_attn_full from prepare_attention to after getting the q_lora and latent cache, which also can decrease much more computation.

NOTE:

We only adapt qwen2/3 and deepseek v2/3/3.2 on ascend backend, for other backends you can add few code to adapt SP.
TP-SP can enable with CP ([Ascend] Deepseek v3 and v3.2 support Context Parallelism #12207) together on Ascend backend. We test it on dsv3.2 model, CP=16, TP=2, which can continue reducing TTFT for 1 percent and decrease peak runtime memory (2G for single device).

Accuracy Tests

The test_gsam8k.py is based on benchmark/gsm8k/bench_sglang.py, you can test it with
python3 benchmark/gsm8k/bench_sglang.py --host http://127.0.0.1 --port 8000 --num-questions 300.

Benchmarking and Profiling

export cann_path=/usr/local/Ascend/ascend-toolkit/latest
source /usr/local/Ascend/driver/bin/setenv.bash
source ${cann_path}/../set_env.sh
source ${cann_path}/../../nnal/atb/set_env.sh
source ${cann_path}/opp/vendors/customize/bin/set_env.bash
export ASCEND_HOME_PATH=${cann_path}

# CPU high preformance
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export SGLANG_SET_CPU_AFFINITY=1

# Memory Fragmentation
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32

# HCCL
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16
export HCCL_BUFFSIZE=1600
#export HCCL_RDMA_PCIE_DIRECT_POST_NOSTRICT=TRUE
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_ALGO="level0:NA;level1:ring"

# Your NIC
export HCCL_SOCKET_IFNAME=enp48s3u1u1
export GLOO_SOCKET_IFNAME=enp48s3u1u1

export PYTHONPATH=$PWD/python/:$PYTHONPATH

python -m sglang.launch_server --model-path $MODEL_PATH \
        --tp-size 16 --dp-size 1 --enable-sp \
        --trust-remote-code --attention-backend ascend --device npu --host 127.0.0.1 --port 8000 \
        --quantization w8a8_int8 --mem-fraction-static 0.79 \
        --chunked-prefill-size 36000 --context-length 36000 --max-prefill-tokens 36000 --max-total-tokens 36000 \
        --disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-11-07T07:29:19Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

iforgetmyname · 2025-11-16T02:54:39Z

+            output = torch.empty(
+                dim_size, dtype=output_parallel.dtype, device=output_parallel.device
+            )
+            self.tp_group.reduce_scatter_tensor(output, output_parallel.contiguous())


Just wondering why we have to do reduce_scatter here in linear, communication can be handled in layer communicator

If we do not use reduce_scatter in linear, we have to set "skip_all_reduce=enable_sp" and do reduce_scatter at both attention o_proj and mlp row parallel linear. And this code is inevitable for other models that need to adapt SP. This parameter is similar to "skip_all_reduce", maybe change name to "use_reduce_scatter" is better?

merrymercy

need a review from @ch-wan
Think about how to avoid inserting communication primitives into model forward code and make them more reusable for more models
Add a GPU test case

merrymercy · 2025-11-23T03:40:05Z

    mamba_full_memory_ratio: float = 0.9

+    # Sequence parallelism
+    enable_sp: bool = False


move to under # Runtime options

ch-wan

I had a quick review. Overall, I feel that the code quality can be much improved. Many changes can be simplified.

ch-wan · 2025-12-01T21:25:55Z

        if mlp_mode == ScatterMode.SCATTERED:
            return ScatterMode.SCATTERED
        if mlp_mode == ScatterMode.FULL:
            return ScatterMode.TP_ATTN_FULL
        raise NotImplementedError

    @classmethod
-    def _compute_layer_output_mode(cls, context: _LayerModeComputationContext):
-        mlp_mode = cls._compute_mlp_mode(context)
+    def _compute_layer_output_mode(


Why do we need to add mlp_mode to all mode propagation? Passing context is enough. Probably mlp_mode can be a @property function of ctx.

ch-wan · 2025-12-01T21:30:41Z

@@ -89,13 +90,13 @@ def __init__(
            )
        self.act_fn = SiluAndMul()

-    def forward(self, x):
-        if get_global_server_args().rl_on_policy_target is not None:
+    def forward(self, x, enable_sp: bool = False):


We can define a util function is_sp_layernorm_enabled to avoid passing this arg to multiple functions. Or we can check get_global_server_args().enable_sp in linear.py. You can refer to how we implemented is_dp_attention_enabled).

ch-wan · 2025-12-01T21:31:23Z

@@ -3164,6 +2979,13 @@ def add_cli_args(parser: argparse.ArgumentParser):
            help="The ratio of mamba state memory to full kv cache memory.",
        )

+        # Sequence parallelism
+        parser.add_argument(
+            "--enable-sp",


I recommend to rename it as --enable-sp-layernorm for clarity.

ch-wan · 2025-12-01T21:34:51Z

-        else:
-            forward_batch.prepare_attn_tp_scatter_input(self)
+            forward_batch.prepare_mlp_sync_batch(
+                self, get_global_server_args().enable_sp


We can get this server arg internally.

merrymercy

Let us hold this until we fix all code quality issues

randgun requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam and merrymercy as code owners November 7, 2025 07:29

github-actions Bot added the deepseek label Nov 7, 2025

randgun marked this pull request as draft November 7, 2025 07:31

Fridge003 mentioned this pull request Nov 8, 2025

Development Roadmap (2026 Q1) #12780

Open

randgun changed the title ~~support tp-sp on qwen2/3 & deepseek v2/3~~ [feat] support tp-sp on qwen2/3 & deepseek v2/3 Nov 10, 2025

randgun changed the title ~~[feat] support tp-sp on qwen2/3 & deepseek v2/3~~ [feat] support tp-sp on qwen2/3 & deepseek v2/3/3.2 Nov 10, 2025

randgun changed the title ~~[feat] support tp-sp on qwen2/3 & deepseek v2/3/3.2~~ [Feature] support tp-sp on qwen2/3 & deepseek v2/3/3.2 Nov 10, 2025

randgun changed the title ~~[Feature] support tp-sp on qwen2/3 & deepseek v2/3/3.2~~ [WIP][Feature] support tp-sp on qwen2/3 & deepseek v2/3/3.2 Nov 10, 2025

randgun marked this pull request as ready for review November 10, 2025 07:46

randgun force-pushed the new_sp branch from 2a65e28 to 85a637a Compare November 15, 2025 14:26

randgun requested a review from yizhang2077 as a code owner November 15, 2025 14:26

[Feat] Support SP for Qwen2/3 & Deepseek v2/3/3.2

fb0f2e5

randgun force-pushed the new_sp branch from 85a637a to fb0f2e5 Compare November 15, 2025 15:00

iforgetmyname reviewed Nov 16, 2025

View reviewed changes

randgun commented Nov 17, 2025

View reviewed changes

Comment thread python/sglang/srt/layers/communicator.py Outdated

Comment thread python/sglang/srt/layers/linear.py Outdated

merrymercy reviewed Nov 23, 2025

View reviewed changes

ch-wan reviewed Dec 1, 2025

View reviewed changes

merrymercy requested changes Dec 2, 2025

View reviewed changes

ch-wan self-assigned this Dec 2, 2025

iforgetmyname mentioned this pull request Jan 23, 2026

[Roadmap] Ascend NPU Development (2026 Q1) #13664

Open

28 tasks

merrymercy mentioned this pull request Apr 16, 2026

Development Roadmap (2026 Q2) #22949

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][Feature] support tp-sp on qwen2/3 & deepseek v2/3/3.2#12820

[WIP][Feature] support tp-sp on qwen2/3 & deepseek v2/3/3.2#12820
randgun wants to merge 1 commit intosgl-project:mainfrom
randgun:new_sp

randgun commented Nov 7, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Nov 7, 2025

Uh oh!

iforgetmyname Nov 16, 2025 •

edited

Loading

Uh oh!

randgun Nov 17, 2025

Uh oh!

Uh oh!

Uh oh!

merrymercy left a comment

Uh oh!

merrymercy Nov 23, 2025

Uh oh!

ch-wan left a comment

Uh oh!

ch-wan Dec 1, 2025

Uh oh!

ch-wan Dec 1, 2025

Uh oh!

ch-wan Dec 1, 2025

Uh oh!

ch-wan Dec 1, 2025

Uh oh!

merrymercy left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

randgun commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Nov 7, 2025

Uh oh!

iforgetmyname Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

randgun Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

merrymercy left a comment

Choose a reason for hiding this comment

Uh oh!

merrymercy Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

ch-wan left a comment

Choose a reason for hiding this comment

Uh oh!

ch-wan Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

ch-wan Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

ch-wan Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

ch-wan Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

randgun commented Nov 7, 2025 •

edited

Loading

iforgetmyname Nov 16, 2025 •

edited

Loading