[AMD][Kimi K2.5 Day 0] ROCm: route W4A16 MoE to Triton and fix packed-weight loading by jhinpan · Pull Request #17863 · sgl-project/sglang

jhinpan · 2026-01-28T06:30:11Z

Motivation

As issue #17854
On ROCm, CompressedTensorsWNA16MoEMethod currently routes to Marlin kernels by default. Marlin is NVIDIA‑only, which breaks Kimi‑K2.5 (native INT4) on MI300X. This patch dispatches ROCm to Triton and fixes the weight‑loading transpose path to avoid shape mismatches.

Modifications

python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py
- Add CompressedTensorsWNA16TritonMoEMethod and convert weights/scales to Triton layout.
- Dispatch to Triton when is_hip() is true.
python/sglang/srt/layers/moe/fused_moe_triton/layer.py
- Include CompressedTensorsWNA16TritonMoEMethod in packed‑weight transpose checks.

Accuracy Tests

tested on gsm8k

Benchmarking and Profiling

Benchmark results here: notion.so/Kimi-K2-5-on-MI300X-2f5651cb22e580cb9395d6169ee59d66?pvs=73

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-28T06:30:15Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

HaiShaw · 2026-01-28T07:27:42Z

/tag-and-rerun-ci

HaiShaw

This should for MI350/355 as well
Can we add MoE triton tuning later?

jhinpan · 2026-01-28T18:31:45Z

This should for MI350/355 as well

Can we add MoE triton tuning later?

Yes. I will test it this weekend when I get MI350 back.
Sure. When we have the kernel optimization pipeline. We can also give it a try.

kkHuang-amd · 2026-01-29T00:26:06Z

+        if getattr(layer, "is_triton_converted", False):
+            return
+
+        num_experts = layer.w13_weight_packed.shape[0]


I don't see this variable "num_experts" is used in this function block, does it still need to exist?

Thanks for this nit! Will clean it today

…gl-project#17863)

jhinpan added 2 commits January 28, 2026 05:06

Add ROCm Triton W4A16 MoE path

60e89c3

run pre-commit

2a281cc

jhinpan requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners January 28, 2026 06:30

Merge branch 'main' into k2.5-support

f269527

HaiShaw assigned kkHuang-amd Jan 28, 2026

github-actions Bot added the run-ci label Jan 28, 2026

HaiShaw added the amd label Jan 28, 2026

HaiShaw approved these changes Jan 28, 2026

View reviewed changes

HaiShaw merged commit 1953efb into sgl-project:main Jan 28, 2026
167 of 189 checks passed

kkHuang-amd reviewed Jan 29, 2026

View reviewed changes

HaiShaw changed the title ~~[AMD] ROCm: route W4A16 MoE to Triton and fix packed-weight loading~~ [AMD][Kimi K2.5 Day 0] ROCm: route W4A16 MoE to Triton and fix packed-weight loading Jan 29, 2026

jhinpan deleted the k2.5-support branch January 29, 2026 20:59

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026

[AMD] ROCm: route W4A16 MoE to Triton and fix packed-weight loading (s…

2a9bbba

…gl-project#17863)

Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026

[AMD] ROCm: route W4A16 MoE to Triton and fix packed-weight loading (s…

675945f

…gl-project#17863)

jhinpan mentioned this pull request Jan 30, 2026

Remove unused num_experts in CompressedTensorsWNA16TritonMoEMethod #18008

Open

2 tasks

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

[AMD] ROCm: route W4A16 MoE to Triton and fix packed-weight loading (s…

305a068

…gl-project#17863)

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[AMD] ROCm: route W4A16 MoE to Triton and fix packed-weight loading (s…

5ae895a

…gl-project#17863)

akao-amd mentioned this pull request Mar 20, 2026

[Feature]: int4_w4a16 kernel support ROCm/aiter#2371

Open

Arist12 mentioned this pull request Apr 7, 2026

redo(sglang-kimi-w4a16-moe-dispatch): behavioral GPU test amdpilot-org/amdpilot-evals#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD][Kimi K2.5 Day 0] ROCm: route W4A16 MoE to Triton and fix packed-weight loading#17863

[AMD][Kimi K2.5 Day 0] ROCm: route W4A16 MoE to Triton and fix packed-weight loading#17863
HaiShaw merged 3 commits intosgl-project:mainfrom
jhinpan:k2.5-support

jhinpan commented Jan 28, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jan 28, 2026

Uh oh!

HaiShaw commented Jan 28, 2026

Uh oh!

HaiShaw left a comment

Uh oh!

Uh oh!

jhinpan commented Jan 28, 2026

Uh oh!

kkHuang-amd Jan 29, 2026

Uh oh!

jhinpan Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jhinpan commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 28, 2026

Uh oh!

HaiShaw commented Jan 28, 2026

Uh oh!

HaiShaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jhinpan commented Jan 28, 2026

Uh oh!

kkHuang-amd Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

jhinpan Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jhinpan commented Jan 28, 2026 •

edited

Loading