Checklist
Describe the bug
I've observed a significant performance regression when running DeepSeek V3.2 compared to V3.1 in prefill-decode disaggregation mode. The prefill throughput for V3.2 is approximately 1/3 of V3.1's performance under identical conditions.
DeepSeek V3.2:

DeepSeek V3.1:

Reproduction
Environment:
- SGLang version: main branch (latest)
- Setup: Prefill-decode disaggregation (1 prefill node + 2 decode nodes)
- Test target: Prefill node only
- Hardware: 8x GPU (TP=8, EP=8)
Test Configuration:
- Dataset: Average length 3.5k tokens, max length 15k tokens
- Launch command: Identical for both models (see below), only model weights differ
Launch Command
python -m sglang.launch_server \
--model-path /mnt/beegfs/models/DeepSeek-V3.2 \
--served-model-name DeepSeek-V3.2 \
--disaggregation-mode prefill \
--host 192.168.100.1 \
--port 30000 \
--tp-size 8 \
--ep-size 8 \
--trust-remote-code \
--chunked-prefill-size 16384 \
--moe-dense-tp-size 1 \
--enable-eplb \
--ep-dispatch-algorithm dynamic \
--eplb-algorithm deepseek \
--mem-fraction-static 0.75 \
--ep-num-redundant-experts 16 \
--moe-a2a-backend deepep \
--deepep-mode normal \
--eplb-rebalance-num-iterations 300 \
--enable-expert-distribution-metrics \
--watchdog-timeout 900 \
--speculative-algorithm EAGLE \
--speculative-num-steps 2 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--max-running-requests 2000 \
--speculative-attention-mode prefill \
--speculative-accept-threshold-single 0.95 \
--speculative-accept-threshold-acc 0.98 \
--enable-metrics \
--enable-cache-report \
--page-size 64 \
--enable-nccl-nvls \
--deepep-config /mnt/beegfs/sglang-workspace/deepep_prefix_8_tune.json
Environment
Python: 3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H200 GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.9, V12.9.86 CUDA Driver Version: 580.82.07 PyTorch: 2.8.0+cu129 sglang: 0.5.3.post1 sgl_kernel: 0.3.15 flashinfer_python: 0.4.0 triton: 3.4.0 transformers: 4.57.0 torchao: 0.9.0 numpy: 2.3.3 aiohttp: 3.12.15 fastapi: 0.116.1 hf_transfer: 0.1.9 huggingface_hub: 0.35.3 interegular: 0.3.3 modelscope: 1.29.2 orjson: 3.11.3 outlines: 0.1.11 packaging: 25.0 psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.25
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.68.1
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE SYS SYS SYS SYS SYS SYS 0-23,96-119 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX NODE SYS SYS SYS SYS SYS SYS 0-23,96-119 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE PIX SYS SYS SYS SYS SYS SYS 0-23,96-119 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS NODE PIX SYS SYS SYS SYS 24-47,120-143 1 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS PIX NODE NODE SYS 48-71,144-167 2 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS NODE PIX NODE SYS 48-71,144-167 2 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS NODE NODE PIX SYS 48-71,144-167 2 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS SYS SYS PIX 72-95,168-191 3 N/A
NIC0 PIX NODE NODE SYS SYS SYS SYS SYS X NODE NODE SYS SYS SYS SYS SYS SYS
NIC1 NODE PIX NODE SYS SYS SYS SYS SYS NODE X NODE SYS SYS SYS SYS SYS SYS
NIC2 NODE NODE PIX SYS SYS SYS SYS SYS NODE NODE X SYS SYS SYS SYS SYS SYS
NIC3 SYS SYS SYS NODE SYS SYS SYS SYS SYS SYS SYS X NODE SYS SYS SYS SYS
NIC4 SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS NODE X SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PIX NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE SYS
NIC6 SYS SYS SYS SYS NODE PIX NODE SYS SYS SYS SYS SYS SYS NODE X NODE SYS
NIC7 SYS SYS SYS SYS NODE NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE X SYS
NIC8 SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
ulimit soft: 1048576
Checklist
Describe the bug
I've observed a significant performance regression when running DeepSeek V3.2 compared to V3.1 in prefill-decode disaggregation mode. The prefill throughput for V3.2 is approximately 1/3 of V3.1's performance under identical conditions.
DeepSeek V3.2:

DeepSeek V3.1:

Reproduction
Environment:
Test Configuration:
Launch Command
Environment
Python: 3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H200 GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.9, V12.9.86 CUDA Driver Version: 580.82.07 PyTorch: 2.8.0+cu129 sglang: 0.5.3.post1 sgl_kernel: 0.3.15 flashinfer_python: 0.4.0 triton: 3.4.0 transformers: 4.57.0 torchao: 0.9.0 numpy: 2.3.3 aiohttp: 3.12.15 fastapi: 0.116.1 hf_transfer: 0.1.9 huggingface_hub: 0.35.3 interegular: 0.3.3 modelscope: 1.29.2 orjson: 3.11.3 outlines: 0.1.11 packaging: 25.0 psutil: 7.0.0 pydantic: 2.11.7 python-multipart: 0.0.20 pyzmq: 27.1.0 uvicorn: 0.35.0 uvloop: 0.21.0 vllm: Module Not Found xgrammar: 0.1.25 openai: 1.99.1 tiktoken: 0.11.0 anthropic: 0.68.1 litellm: Module Not Found decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE SYS SYS SYS SYS SYS SYS 0-23,96-119 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX NODE SYS SYS SYS SYS SYS SYS 0-23,96-119 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE PIX SYS SYS SYS SYS SYS SYS 0-23,96-119 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS NODE PIX SYS SYS SYS SYS 24-47,120-143 1 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS PIX NODE NODE SYS 48-71,144-167 2 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS NODE PIX NODE SYS 48-71,144-167 2 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS NODE NODE PIX SYS 48-71,144-167 2 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS SYS SYS PIX 72-95,168-191 3 N/A NIC0 PIX NODE NODE SYS SYS SYS SYS SYS X NODE NODE SYS SYS SYS SYS SYS SYS NIC1 NODE PIX NODE SYS SYS SYS SYS SYS NODE X NODE SYS SYS SYS SYS SYS SYS NIC2 NODE NODE PIX SYS SYS SYS SYS SYS NODE NODE X SYS SYS SYS SYS SYS SYS NIC3 SYS SYS SYS NODE SYS SYS SYS SYS SYS SYS SYS X NODE SYS SYS SYS SYS NIC4 SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS NODE X SYS SYS SYS SYS NIC5 SYS SYS SYS SYS PIX NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE SYS NIC6 SYS SYS SYS SYS NODE PIX NODE SYS SYS SYS SYS SYS SYS NODE X NODE SYS NIC7 SYS SYS SYS SYS NODE NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE X SYS NIC8 SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8 ulimit soft: 1048576