Skip to content

[AMD] Support context parallel for DeepSeek-V3.2 on AMD GPUs and add its test to nightly CI#19975

Open
hubertlu-tw wants to merge 3 commits intosgl-project:mainfrom
hubertlu-tw:dsv32_cp
Open

[AMD] Support context parallel for DeepSeek-V3.2 on AMD GPUs and add its test to nightly CI#19975
hubertlu-tw wants to merge 3 commits intosgl-project:mainfrom
hubertlu-tw:dsv32_cp

Conversation

@hubertlu-tw
Copy link
Copy Markdown
Collaborator

@hubertlu-tw hubertlu-tw commented Mar 5, 2026

Motivation

This PR adds and validates context parallel support for DeepSeek-V3.2 on AMD GPUs, focusing on the round-robin-split CP mode with AITER NSA backends.
The goal is to improve long-context serving throughput and latency on AMD platforms while preserving model quality.

Modifications

  • Added AMD CP coverage for DeepSeek-V3.2 single-node test flow (round-robin-split mode).
  • Added/updated ROCm 7.2 workflow job wiring for DeepSeek-V3.2 CP accuracy suite.
  • Updated DeepSeek-V3.2 documentation to clarify AMD support scope:
    • AMD supports round-robin-split CP mode only.
    • Recommended NSA backends on AMD are aiter for both prefill/decode.
    • Added explicit AMD launch example.
  • Updated AMD perf test configuration for DeepSeek-V3.2 basic benchmark cases (batch/input/output settings).

Accuracy Tests

GSM8K results

Setup Shots Accuracy Invalid Latency (s) Output throughput (tok/s)
OG (no CP) 8 0.961 0.000 72.274 1825.415
OG (no CP) 20 0.955 0.000 72.671 1798.530
Context-Parallel (round-robin-split) 8 0.956 0.000 85.697 1528.505
Context-Parallel (round-robin-split) 20 0.958 0.000 57.345 2260.404

Accuracy remains stable (within expected variance) between OG and CP runs.

Benchmarking and Profiling

Serving benchmark comparison (requested metrics)

Scenario Setup Total token throughput (tok/s) Mean E2E Latency (ms) Mean TTFT (ms) Mean TPOT (ms)
Concurrency=4, Prompts=32 OG (no CP) 7156.80 39228.87 20797.84 92.62
Concurrency=4, Prompts=32 Context-Parallel 12594.94 (+75.98%) 22286.10 (+43.19%) 7108.21 (+65.82%) 76.27 (+17.65%)
Concurrency=8, Prompts=64 OG (no CP) 7508.30 74788.74 37203.10 188.87
Concurrency=8, Prompts=64 Context-Parallel 22516.16 (+199.88%) 24863.21 (+66.75%) 4396.70 (+88.18%) 102.85 (+45.54%)

Notes

  • CP (round-robin-split) shows clear serving throughput gain and latency reduction for long-context requests in this setup.
  • Benchmarks were run using local model path /data2/deepseek-ai/DeepSeek-V3.2-Exp.
Click to expand run commands

OG baseline (no CP)

python3 -m sglang.launch_server \
  --model /data2/deepseek-ai/DeepSeek-V3.2-Exp \
  --tp 8 \
  --trust-remote-code \
  --nsa-prefill-backend aiter \
  --nsa-decode-backend aiter
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319
python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --model /data2/deepseek-ai/DeepSeek-V3.2-Exp \
  --dataset-name random \
  --random-input 70000 \
  --random-output 200 \
  --random-range-ratio 1.0 \
  --max-concurrency 4 \
  --num-prompts 32
python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --model /data2/deepseek-ai/DeepSeek-V3.2-Exp \
  --dataset-name random \
  --random-input 70000 \
  --random-output 200 \
  --random-range-ratio 1.0 \
  --max-concurrency 8 \
  --num-prompts 64

Context-Parallel (round-robin-split)

python3 -m sglang.launch_server \
  --model /data2/deepseek-ai/DeepSeek-V3.2-Exp \
  --tp 8 \
  --trust-remote-code \
  --nsa-prefill-backend aiter \
  --nsa-decode-backend aiter \
  --enable-nsa-prefill-context-parallel \
  --nsa-prefill-cp-mode round-robin-split \
  --attn-cp-size 8 \
  --chunked-prefill-size 16384
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319
python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --model /data2/deepseek-ai/DeepSeek-V3.2-Exp \
  --dataset-name random \
  --random-input 70000 \
  --random-output 200 \
  --random-range-ratio 1.0 \
  --max-concurrency 4 \
  --num-prompts 32
python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --model /data2/deepseek-ai/DeepSeek-V3.2-Exp \
  --dataset-name random \
  --random-input 70000 \
  --random-output 200 \
  --random-range-ratio 1.0 \
  --max-concurrency 8 \
  --num-prompts 64

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

CC: @HaiShaw @yctseng0211 @bingxche

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added documentation Improvements or additions to documentation amd deepseek labels Mar 5, 2026
@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Mar 5, 2026

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd deepseek documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants