Skip to content

[AMD] Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark#1626

Merged
functionstackx merged 11 commits into
mainfrom
seungrokj/dsv4-fp4-mi355x-atom-dp
Jun 1, 2026
Merged

[AMD] Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark#1626
functionstackx merged 11 commits into
mainfrom
seungrokj/dsv4-fp4-mi355x-atom-dp

Conversation

@seungrokj

@seungrokj seungrokj commented May 31, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add new benchmark config dsv4-fp4-mi355x-atom-dp for DeepSeek-V4-Pro with DP-attention on MI355X using ATOM
  • Add new script benchmarks/single_node/dsv4_fp4_mi355x_atom_dp.sh with --enable-dp-attention --gpu-memory-utilization 0.85
  • Image: rocm/atom-dev:nightly_202605301523 (ATOM upstream run 26690241645, 2026-05-30)
  • Concurrency range: 64–1024 for both ISL 1024 and 8192

Performance vs current InferenceX (dsv4-fp4-mi355x-atom, nightly_202605130853)

ISL OSL Conc InferenceX (tok/s/GPU) ATOM DP (tok/s/GPU) Δ%
1024 1024 64 389.30 443.01 +13.8%
1024 1024 128 601.21 774.50 +28.8%
1024 1024 256 880.78 1322.72 +50.2%
1024 1024 512 2028.30
1024 1024 1024 2984.23
8192 1024 64 1162.87 1505.66 +29.5%
8192 1024 128 1469.89 2366.74 +61.0%
8192 1024 256 704.73 3404.86 +383.1%
8192 1024 512 4196.99

Test plan

  • Verify dsv4_fp4_mi355x_atom_dp.sh starts atom server with --enable-dp-attention --gpu-memory-utilization 0.85
  • Confirm dsv4-fp4-mi355x-atom-dp config picks up the new script
  • Run benchmark at conc=64 and conc=256 to confirm throughput matches upstream numbers

🤖 Generated with Claude Code


Note

Low Risk
Benchmark and image-tag changes only; no application auth or production serving paths affected.

Overview
Extends the existing dsv4-fp4-mi355x-atom MI355X ATOM benchmark for DeepSeek-V4-Pro FP4 with data-parallel attention, using the same dsv4_fp4_mi355x_atom.sh launcher rather than a separate config key.

The container image moves from rocm/atom-dev:nightly_202605130853 to rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.3. The search space is split into two bands per ISL: TP8, EP1, conc 1–64 without DP-attn, then dp-attn: true from conc 64–1024 (1k/1k) or 64–512 (8k/1k).

The benchmark script now builds PARALLEL_ARGS from matrix DP_ATTENTION and EP_SIZE (--enable-dp-attention, and expert parallel when EP>1), and starts the server with --gpu-memory-utilization 0.85. perf-changelog.yaml records the update.

Reviewed by Cursor Bugbot for commit bc53139. Bugbot is set up for automated code reviews on this repo. Configure here.

Add new benchmark config for DeepSeek-V4-Pro with DP-attention
enabled on MI355X using ATOM. Uses image
rocm/atom-dev:nightly_202605301523 with --enable-dp-attention
and --gpu-memory-utilization 0.85. Concurrency range 64-1024.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

2 similar comments
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread .github/configs/amd-master.yaml Outdated
Comment thread perf-changelog.yaml Outdated
- Consolidate the DP-attention and non-DP search spaces under a single
  dsv4-fp4-mi355x-atom config key using the stable atom0.1.3 image
- Delete the standalone dsv4_fp4_mi355x_atom_dp.sh benchmark script
  (DP-attention now handled by the shared glm5 script pattern)
- Update glm5_fp8_mi355x_atom.sh to support DP_ATTENTION flag via
  PARALLEL_ARGS, enabling dp-attn and expert-parallel combinations
- Update perf-changelog.yaml config-key and image reference accordingly

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@seungrokj seungrokj changed the title Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark ]AMD] Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark May 31, 2026
@seungrokj seungrokj changed the title ]AMD] Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark [AMD] Add DeepSeek-V4-Pro FP4 MI355X ATOM DP-attention benchmark May 31, 2026
Comment thread .github/configs/amd-master.yaml Outdated
seungrokj and others added 3 commits May 31, 2026 12:56
…ript

- dsv4_fp4_mi355x_atom.sh: replace EP string construction with
  PARALLEL_ARGS array pattern supporting DP_ATTENTION + EP_SIZE combos
- glm5_fp8_mi355x_atom.sh: revert PARALLEL_ARGS back to simple -tp/$EP
  (glm5 does not use dp-attention)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread benchmarks/single_node/dsv4_fp4_mi355x_atom.sh
@github-actions

Copy link
Copy Markdown
Contributor

2 similar comments
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

… for prefix caching

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ff26684. Configure here.

Comment thread benchmarks/single_node/dsv4_fp4_mi355x_atom.sh
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

seungrokj and others added 2 commits June 1, 2026 21:57
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@seungrokj

Copy link
Copy Markdown
Collaborator Author

@functionstackx can you approve this?

@functionstackx

Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@functionstackx functionstackx merged commit 99008ef into main Jun 1, 2026
89 of 91 checks passed
@functionstackx functionstackx deleted the seungrokj/dsv4-fp4-mi355x-atom-dp branch June 1, 2026 19:49
Oseltamivir added a commit that referenced this pull request Jun 1, 2026
#26383 (the DSv4 MTP graph fix) is on sglang main, not the amd/deepseek_v4
branch the rocm/sgl-dev:*-DSv4 images are cut from, so switch the MTP entry
onto the mainline ROCm nightly lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260601
which carries it. Mainline omits deep_gemm; the recipe now detects that and
routes the DSv4 fp8 wo_a / topk paths to their torch fallbacks
(SGLANG_OPT_FP8_WO_A_GEMM=0, SGLANG_TOPK_TRANSFORM_512_TORCH=1,
SGLANG_ENABLE_JIT_DEEPGEMM=0). No-op on a deep_gemm-bearing image.

Resolve perf-changelog conflict: keep atom (#1626) and vllm-mtp (#1630) from
main, update the sglang-mtp entry for the mainline image.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants