Merge upstream vLLM code into gfx11 by amd-callumm · Pull Request #983 · ROCm/vllm

amd-callumm · 2026-05-29T21:35:25Z

Purpose

Merge all commits from vllm-project/vllm:main since the last common ancestor with ROCm/vllm:gfx11.

The commit count and change volume are huge, and included a fair number of conflict resolutions. Thus, substantial testing is needed to ensure critical functionality and optimizations are not lost.

Customer safeguard: in the event that some functionality breaks or performance regresses as a result of this merge, the gfx11_20260528 tag can be used to obtain stable pre-merge code. This package is based on the May 28 nightly build which showed good numbers and test results during nightly regressions.

Test Plan

Run representative subsets of all of the following test suites on a Strix Halo machine pre- and post-merge:
- tests/basic_correctness/test_basic_correctness.py
- tests/kernels/core/ (fundamental kernel operations)
- tests/kernels/moe/test_exllama_moe.py
- tests/kernels/moe/test_hybrid_w4a16_moe.py
- tests/kernels/quantization/test_awq_gemv_moe.py
- tests/kernels/quantization/test_hip_w4a16.py
- tests/kernels/quantization/test_hybrid_w4a16_triton.py
- tests/kernels/quantization/test_rocm_compressed_tensors_w4a16.py
- tests/kernels/quantization/test_rocm_skinny_gemms.py
- tests/kernels/quantization/test_dynamic_int8_lm_head.py
- tests/kernels/test_wvsplitk_fused_silu.py
- tests/kernels/attention/test_rocm_attention_selector.py
- tests/kernels/attention/test_triton_unified_attention.py
- tests/kernels/attention/test_cache.py (KV cache operations + reshaping)
- tests/kernels/attention/test_flash_attn.py
- tests/kernels/attention/test_paged_attn.py
- tests/kernels/moe/test_batched_moe.py
- tests/kernels/moe/test_fused_topk.py
- tests/quantization/test_hip_w4a16_kernel.py
- tests/quantization/test_fp8.py
- tests/quantization/test_compressed_tensors.py
- tests/samplers/test_beam_search.py
- tests/samplers/test_ignore_eos.py
- tests/models/language/generation/test_common.py
- tests/entrypoints/llm/test_generate.py
- tests/rocm/aiter/test_grouped_quant.py (grouped FP8 quantization)
- tests/rocm/aiter/test_mla_fp8_support_check.py (MLA FP8 support)
- tests/rocm/aiter/test_fused_qk_norm_mrope_kvcache.py
Run attention benchmarking tests to check for any regressions
Run a sample of models/use cases from nightly gfx1151 benchmarks to check for severe performance regressions pre- and post-merge
For any regressions found, plan next steps

Test Result

For all correctness/functionality tests, exactly the same test cases are passing both pre- and post-merge. Out of ~1700 tests, 8 tests failed, all for pre-existing and low-risk reasons such as model gating.

Attention benchmark tests shows some prefill regressions of up to 33% (SmolLM2-1.7B-Instruct-AWQ). Average regression for pure prefill cases is ~10.9%. Decode performance is ~3.3% slower on average (range: 1.7% speedup to 9.4% slowdown).

However, across 16 end-to-end tests, most showed little difference in overall TPOT/TTFT/end-to-end latency compared to pre-merge. A few short-context cases (~128 input tokens) showed 1-3% TTFT slowdown; this short context suggests that the difference is something other than attention, eg. GEMM, CPU overhead.

All of these end-to-end benchmarks successfully completed post-merge.

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

…2537) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

…llm-project#42766) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>

…[2/N] (vllm-project#43039) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

Signed-off-by: george <george@inferact.ai> Co-authored-by: george <george@inferact.ai>

…hase A and Phase B (vllm-project#42289) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: gemini-code-assist <noreply@google.com>

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vllm-project#42671) Signed-off-by: junyanxu <junyanxu5513@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>

Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…42946) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

…llm-project#43073) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

…tion + clear_cache (vllm-project#42117) Signed-off-by: hao-aaron <ahao@anyscale.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

…oject#42828) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

…ject#43077) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

Signed-off-by: shen-shanshan <467638484@qq.com>

Signed-off-by: ZhanqiuHu <zhu@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

…te` (vllm-project#42887) Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

Signed-off-by: Nick Hill <nickhill123@gmail.com>

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com>

…tool parser (vllm-project#43025) Signed-off-by: sfeng33 <4florafeng@gmail.com>

Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>

…vllm-project#42080) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

…-project#42994) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>

…ous layers (vllm-project#42976) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…r autotune (vllm-project#43119) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: Kevin H. Luu <khluu000@gmail.com>

…casts in rotary path (vllm-project#42833) Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com>

…roject#43550) Signed-off-by: Aditya Singh <adisin650@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

… & non-streaming paths (vllm-project#43662) Signed-off-by: Bugen Zhao <i@bugenzhao.com>

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

…in-aligned W4A16 shapes (vllm-project#43731) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>

…rgs (vllm-project#43401) Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

…E=1 (vllm-project#39155) Signed-off-by: Injae Ryou <injaeryou@gmail.com>

Signed-off-by: chunyang.wen <chunyang.wen@gmail.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…ORI_INTERNODE_KERNEL (vllm-project#41751) Signed-off-by: jatseng-ai <jatseng@amd.com>

…continued) (vllm-project#43361) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com>

Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

Signed-off-by: Callum Mitchell <callumm@amd.com> Signed-off-by: <callumm@amd.com>

mgehre-amd

Looks good, thanks!

Build is failing (OOM for skinny int4 GEMM?), please check

amd-callumm · 2026-06-02T19:47:22Z

Build is failing (OOM for skinny int4 GEMM?), please check

Looking into this now.

Signed-off-by: <callumm@amd.com>

amd-callumm · 2026-06-02T20:49:15Z

Reducing MAX_JOBS from 2 -> 1 in CI seemed to do the trick, at the cost of somewhat slower build times. I'll merge this if the rest of CI looks good.

eble-amd · 2026-06-03T13:02:38Z

@amd-callumm The performance test job failed on runner 'linux-strix-halo-gpu-rocm-8-gpu0-1780422175. I don't pay attention to every job, but every time I've checked, they have failed on runners 6 and 7, and passed on runners 8 and 9. This failure on 8 deserves some follow-up.

jikunshang and others added 30 commits May 19, 2026 11:17

[XPU] add gptq(int4) support (vllm-project#37844)

36dcaf2

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

[UX] Add a persistent cache for FlashInfer autotuning (vllm-project#4…

da03e54

…2537) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

[Bugfix][MRV2] Fix KVCache tensor explicit kernel_block_size dim (v…

fba010d

…llm-project#42766) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>

[Model Refactoring] Move DeepSeek V4 layers to models/deepseek_v4/ …

87b08c5

…[2/N] (vllm-project#43039) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

add cutedsl dsv4 indexer fp8 kernel (vllm-project#42899)

3ca8db2

Signed-off-by: george <george@inferact.ai> Co-authored-by: george <george@inferact.ai>

[Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between P…

fab07e4

…hase A and Phase B (vllm-project#42289) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: gemini-code-assist <noreply@google.com>

[ci] Route 28 gpu_1_queue tests to h200_35gb queue (vllm-project#43030)

6e889b5

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use keyword arguments for shard_id and expert_id in weight_loade… (

27f4ba9

vllm-project#42671) Signed-off-by: junyanxu <junyanxu5513@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

[XPU] Use custom op collective behavior (vllm-project#41354)

f1e3f0e

Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

[Misc] Aligning tokwise pooler heads for consistency (vllm-project#43041

4a4fdab

) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>

[Frontend] Consolidate beam search by BeamSearchMixin. (vllm-project#…

301d986

…42946) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

[Model Refactoring] Move deepseek_v4_ops to models/deepseek_v4 [3/N] (v…

b14be81

…llm-project#43073) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

[bug] AsyncScheduler drops first post-resume token after pause_genera…

f34623b

…tion + clear_cache (vllm-project#42117) Signed-off-by: hao-aaron <ahao@anyscale.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

[KVConnector][DSV4] HMA support for Mooncake store connector (vllm-pr…

056bc2e

…oject#42828) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

[Model Refactoring] Rename deepseek_v4.py to model.py [4/N] (vllm-pro…

07beaed

…ject#43077) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

[Misc][MM] Remove redundant code in CLIPAttention (vllm-project#43046)

ef54a4d

Signed-off-by: shen-shanshan <467638484@qq.com>

[CI] Add MTP + PD disagg test for Qwen3.5 (vllm-project#42677)

129019f

Signed-off-by: ZhanqiuHu <zhu@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

[Bugfix] Fix top logprobs token placeholders in `/inference/v1/genera…

a78b842

…te` (vllm-project#42887) Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

[Perf][4/n] Eliminate various GPU<->CPU syncs (vllm-project#42347)

b82e908

Signed-off-by: Nick Hill <nickhill123@gmail.com>

[XPU] update xpu graph usage (vllm-project#43043)

d740e2c

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

[Model] Openvla support (vllm-project#42654)

1c61580

Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com>

[Refactor] Extract extract_types_from_schema utility from Minimax M2 …

42b4f1f

…tool parser (vllm-project#43025) Signed-off-by: sfeng33 <4florafeng@gmail.com>

[Misc] add humming to dependencies (vllm-project#42540)

8200fbe

Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>

[feat] Add FP8 per-tensor Q scale support to Triton attention backend (…

d247a93

…vllm-project#42080) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

[Docs] Fix MooncakeStoreConnector role in disaggregated example (vllm…

aed2eb3

…-project#42994) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>

[Bugfix][MoE] FlashInfer one-sided: workspace union across heterogene…

f54721b

…ous layers (vllm-project#42976) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

[ci] Move language models tests (hybrid) back to L4 (vllm-project#43129)

a65093c

Signed-off-by: Kevin H. Luu <khluu000@gmail.com>

akii96 and others added 13 commits May 27, 2026 16:22

[ROCm][GPT-OSS] Avoid repeated compile-time cos_sin_cache.to(bf16) …

de12f5c

…casts in rotary path (vllm-project#42833) Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com>

[Doc] Add Ascend NPU tab to the quickstart installation guide (vllm-p…

ad464e1

…roject#43550) Signed-off-by: Aditya Singh <adisin650@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

[Rust Frontend] Align tool parser fallback behavior between streaming…

396c8fe

… & non-streaming paths (vllm-project#43662) Signed-off-by: Bugen Zhao <i@bugenzhao.com>

[Docs] Fix MLA prefill backend default docs (vllm-project#43697)

158289e

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

[Kernel] Enable TritonW4A16LinearKernel as CUDA fallback for non-Marl…

2272062

…in-aligned W4A16 shapes (vllm-project#43731) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>

[Bugfix] Map reasoning_effort to enable_thinking in chat template kwa…

52a31cc

…rgs (vllm-project#43401) Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>

[misc] Bump cutedsl version to 4.5.2 (vllm-project#43745)

03d9cc2

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

[BugFix] HFValidationError with cloud storage URIs when HF_HUB_OFFLIN…

1654609

…E=1 (vllm-project#39155) Signed-off-by: Injae Ryou <injaeryou@gmail.com>

[Docs] Fix the duplicate doc icon issue (vllm-project#43546)

49a3510

Signed-off-by: chunyang.wen <chunyang.wen@gmail.com>

Fix early CUDA init (vllm-project#43791)

41688e2

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[ROCm] mori: add InterNodeV1LL inter-node kernel selection via VLLM_M…

05c50c7

…ORI_INTERNODE_KERNEL (vllm-project#41751) Signed-off-by: jatseng-ai <jatseng@amd.com>

[Quantization] Fix Humming RoutedExperts import (vllm-project#43540)

206b72c

Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

amd-callumm force-pushed the callumm.upstream_merge branch from 51987e3 to 6e4fc54 Compare June 1, 2026 19:11

Merge remote-tracking branch 'upstream/main' into callumm.upstream_merge

112c8cb

Signed-off-by: Callum Mitchell <callumm@amd.com> Signed-off-by: <callumm@amd.com>

amd-callumm force-pushed the callumm.upstream_merge branch from 6e4fc54 to 112c8cb Compare June 2, 2026 16:20

amd-callumm marked this pull request as ready for review June 2, 2026 17:23

amd-callumm requested review from eble-amd, marcusr-amd, mgehre-amd, mkorhone and serged-amd June 2, 2026 17:23

mgehre-amd approved these changes Jun 2, 2026

View reviewed changes

[CI] build-rocm-wheels.yml: reduce MAX_JOBS to prevent OOM

adf8d91

Signed-off-by: <callumm@amd.com>

amd-callumm merged commit cdd11a6 into gfx11 Jun 2, 2026
4 of 5 checks passed

amd-callumm mentioned this pull request Jun 5, 2026

[Bugfix] Upstream merge #2 to Transformers v5 fix #986

Merged

parthash0804 mentioned this pull request Jun 10, 2026

Remove orphaned CustomMMDataset audio test (Superseded by upstream custom_audio) #993

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge upstream vLLM code into gfx11 #983

Merge upstream vLLM code into gfx11 #983
amd-callumm merged 915 commits into
gfx11from
callumm.upstream_merge

amd-callumm commented May 29, 2026 •

edited by github-actions Bot

Loading

Uh oh!

mgehre-amd left a comment

Uh oh!

amd-callumm commented Jun 2, 2026

Uh oh!

amd-callumm commented Jun 2, 2026

Uh oh!

Uh oh!

eble-amd commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

amd-callumm commented May 29, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mgehre-amd left a comment

Choose a reason for hiding this comment

Uh oh!

amd-callumm commented Jun 2, 2026

Uh oh!

amd-callumm commented Jun 2, 2026

Uh oh!

Uh oh!

eble-amd commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

amd-callumm commented May 29, 2026 •

edited by github-actions Bot

Loading