[AMD] Support context parallel for DeepSeek-V3.2 on AMD GPUs and add its test to nightly CI by hubertlu-tw · Pull Request #19975 · sgl-project/sglang

hubertlu-tw · 2026-03-05T19:43:28Z

Motivation

This PR adds and validates context parallel support for DeepSeek-V3.2 on AMD GPUs, focusing on the round-robin-split CP mode with AITER NSA backends.
The goal is to improve long-context serving throughput and latency on AMD platforms while preserving model quality.

Modifications

Added AMD CP coverage for DeepSeek-V3.2 single-node test flow (round-robin-split mode).
Added/updated ROCm 7.2 workflow job wiring for DeepSeek-V3.2 CP accuracy suite.
Updated DeepSeek-V3.2 documentation to clarify AMD support scope:
- AMD supports round-robin-split CP mode only.
- Recommended NSA backends on AMD are aiter for both prefill/decode.
- Added explicit AMD launch example.
Updated AMD perf test configuration for DeepSeek-V3.2 basic benchmark cases (batch/input/output settings).

Accuracy Tests

GSM8K results

Setup	Shots	Accuracy	Latency (s)	Output throughput (tok/s)
OG (no CP)	8	0.961	72.274	1825.415
OG (no CP)	20	0.955	72.671	1798.530
Context-Parallel (round-robin-split)	8	0.956	85.697	1528.505
Context-Parallel (round-robin-split)	20	0.958	57.345	2260.404

Accuracy remains stable (within expected variance) between OG and CP runs.

Benchmarking and Profiling

Serving benchmark comparison (requested metrics)

Scenario	Setup	Total token throughput (tok/s)	Mean E2E Latency (ms)	Mean TTFT (ms)	Mean TPOT (ms)
Concurrency=4, Prompts=32	OG (no CP)	7156.80	39228.87	20797.84	92.62
Concurrency=4, Prompts=32	Context-Parallel	12594.94 (+75.98%)	22286.10 (+43.19%)	7108.21 (+65.82%)	76.27 (+17.65%)
Concurrency=8, Prompts=64	OG (no CP)	7508.30	74788.74	37203.10	188.87
Concurrency=8, Prompts=64	Context-Parallel	22516.16 (+199.88%)	24863.21 (+66.75%)	4396.70 (+88.18%)	102.85 (+45.54%)

Notes

CP (round-robin-split) shows clear serving throughput gain and latency reduction for long-context requests in this setup.
Benchmarks were run using local model path /data2/deepseek-ai/DeepSeek-V3.2-Exp.

Click to expand run commands

OG baseline (no CP)

python3 -m sglang.launch_server \
  --model /data2/deepseek-ai/DeepSeek-V3.2-Exp \
  --tp 8 \
  --trust-remote-code \
  --nsa-prefill-backend aiter \
  --nsa-decode-backend aiter

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319

python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --model /data2/deepseek-ai/DeepSeek-V3.2-Exp \
  --dataset-name random \
  --random-input 70000 \
  --random-output 200 \
  --random-range-ratio 1.0 \
  --max-concurrency 4 \
  --num-prompts 32

python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --model /data2/deepseek-ai/DeepSeek-V3.2-Exp \
  --dataset-name random \
  --random-input 70000 \
  --random-output 200 \
  --random-range-ratio 1.0 \
  --max-concurrency 8 \
  --num-prompts 64

Context-Parallel (`round-robin-split`)

python3 -m sglang.launch_server \
  --model /data2/deepseek-ai/DeepSeek-V3.2-Exp \
  --tp 8 \
  --trust-remote-code \
  --nsa-prefill-backend aiter \
  --nsa-decode-backend aiter \
  --enable-nsa-prefill-context-parallel \
  --nsa-prefill-cp-mode round-robin-split \
  --attn-cp-size 8 \
  --chunked-prefill-size 16384

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319

python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --model /data2/deepseek-ai/DeepSeek-V3.2-Exp \
  --dataset-name random \
  --random-input 70000 \
  --random-output 200 \
  --random-range-ratio 1.0 \
  --max-concurrency 4 \
  --num-prompts 32

python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --model /data2/deepseek-ai/DeepSeek-V3.2-Exp \
  --dataset-name random \
  --random-input 70000 \
  --random-output 200 \
  --random-range-ratio 1.0 \
  --max-concurrency 8 \
  --num-prompts 64

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

CC: @HaiShaw @yctseng0211 @bingxche

gemini-code-assist · 2026-03-05T19:43:32Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

HaiShaw · 2026-03-05T22:48:58Z

/tag-and-rerun-ci

hubertlu-tw requested review from Fridge003, Kangyan-Zhou, bingxche, ispobock and merrymercy as code owners March 5, 2026 19:43

Support context parallel for DeepSeek-V3.2 on AMD GPUs

b4bf4fa

github-actions Bot added documentation Improvements or additions to documentation amd deepseek labels Mar 5, 2026

Add test_deepseek_v32_cp_single_node.py to AMD's nightly CI

0286724

hubertlu-tw force-pushed the dsv32_cp branch from 1108af7 to 0286724 Compare March 5, 2026 19:47

github-actions Bot added the run-ci label Mar 5, 2026

Merge branch 'main' into dsv32_cp

5ec3584

hubertlu-tw requested a review from 1am9trash April 22, 2026 19:23

HaiShaw mentioned this pull request Apr 22, 2026

AMD Development Roadmap (2026 Q2) #23494

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Support context parallel for DeepSeek-V3.2 on AMD GPUs and add its test to nightly CI#19975

[AMD] Support context parallel for DeepSeek-V3.2 on AMD GPUs and add its test to nightly CI#19975
hubertlu-tw wants to merge 3 commits intosgl-project:mainfrom
hubertlu-tw:dsv32_cp

hubertlu-tw commented Mar 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 5, 2026

Uh oh!

HaiShaw commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hubertlu-tw commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

GSM8K results

Benchmarking and Profiling

Serving benchmark comparison (requested metrics)

Notes

OG baseline (no CP)

Context-Parallel (round-robin-split)

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 5, 2026

Uh oh!

HaiShaw commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hubertlu-tw commented Mar 5, 2026 •

edited

Loading

Context-Parallel (`round-robin-split`)