Skip to content

Merge upstream vLLM code into gfx11 #983

Merged
amd-callumm merged 915 commits into
gfx11from
callumm.upstream_merge
Jun 2, 2026
Merged

Merge upstream vLLM code into gfx11 #983
amd-callumm merged 915 commits into
gfx11from
callumm.upstream_merge

Conversation

@amd-callumm

@amd-callumm amd-callumm commented May 29, 2026

Copy link
Copy Markdown

Purpose

Merge all commits from vllm-project/vllm:main since the last common ancestor with ROCm/vllm:gfx11.

The commit count and change volume are huge, and included a fair number of conflict resolutions. Thus, substantial testing is needed to ensure critical functionality and optimizations are not lost.

Customer safeguard: in the event that some functionality breaks or performance regresses as a result of this merge, the gfx11_20260528 tag can be used to obtain stable pre-merge code. This package is based on the May 28 nightly build which showed good numbers and test results during nightly regressions.

Test Plan

  • Run representative subsets of all of the following test suites on a Strix Halo machine pre- and post-merge:

    • tests/basic_correctness/test_basic_correctness.py
    • tests/kernels/core/ (fundamental kernel operations)
    • tests/kernels/moe/test_exllama_moe.py
    • tests/kernels/moe/test_hybrid_w4a16_moe.py
    • tests/kernels/quantization/test_awq_gemv_moe.py
    • tests/kernels/quantization/test_hip_w4a16.py
    • tests/kernels/quantization/test_hybrid_w4a16_triton.py
    • tests/kernels/quantization/test_rocm_compressed_tensors_w4a16.py
    • tests/kernels/quantization/test_rocm_skinny_gemms.py
    • tests/kernels/quantization/test_dynamic_int8_lm_head.py
    • tests/kernels/test_wvsplitk_fused_silu.py
    • tests/kernels/attention/test_rocm_attention_selector.py
    • tests/kernels/attention/test_triton_unified_attention.py
    • tests/kernels/attention/test_cache.py (KV cache operations + reshaping)
    • tests/kernels/attention/test_flash_attn.py
    • tests/kernels/attention/test_paged_attn.py
    • tests/kernels/moe/test_batched_moe.py
    • tests/kernels/moe/test_fused_topk.py
    • tests/quantization/test_hip_w4a16_kernel.py
    • tests/quantization/test_fp8.py
    • tests/quantization/test_compressed_tensors.py
    • tests/samplers/test_beam_search.py
    • tests/samplers/test_ignore_eos.py
    • tests/models/language/generation/test_common.py
    • tests/entrypoints/llm/test_generate.py
    • tests/rocm/aiter/test_grouped_quant.py (grouped FP8 quantization)
    • tests/rocm/aiter/test_mla_fp8_support_check.py (MLA FP8 support)
    • tests/rocm/aiter/test_fused_qk_norm_mrope_kvcache.py
  • Run attention benchmarking tests to check for any regressions

  • Run a sample of models/use cases from nightly gfx1151 benchmarks to check for severe performance regressions pre- and post-merge

  • For any regressions found, plan next steps

Test Result

For all correctness/functionality tests, exactly the same test cases are passing both pre- and post-merge. Out of ~1700 tests, 8 tests failed, all for pre-existing and low-risk reasons such as model gating.

Attention benchmark tests shows some prefill regressions of up to 33% (SmolLM2-1.7B-Instruct-AWQ). Average regression for pure prefill cases is ~10.9%. Decode performance is ~3.3% slower on average (range: 1.7% speedup to 9.4% slowdown).

However, across 16 end-to-end tests, most showed little difference in overall TPOT/TTFT/end-to-end latency compared to pre-merge. A few short-context cases (~128 input tokens) showed 1-3% TTFT slowdown; this short context suggests that the difference is something other than attention, eg. GEMM, CPU overhead.

All of these end-to-end benchmarks successfully completed post-merge.

jikunshang and others added 30 commits May 19, 2026 11:17
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
…2537)

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
…llm-project#42766)

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
…[2/N] (vllm-project#43039)

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: george <george@inferact.ai>
Co-authored-by: george <george@inferact.ai>
…hase A and Phase B (vllm-project#42289)

Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: gemini-code-assist <noreply@google.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vllm-project#42671)

Signed-off-by: junyanxu <junyanxu5513@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
)

Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
…tion + clear_cache (vllm-project#42117)

Signed-off-by: hao-aaron <ahao@anyscale.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: ZhanqiuHu <zhu@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com>
…tool parser (vllm-project#43025)

Signed-off-by: sfeng33 <4florafeng@gmail.com>
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
…vllm-project#42080)

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
…-project#42994)

Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
…ous layers (vllm-project#42976)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…r autotune (vllm-project#43119)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
akii96 and others added 13 commits May 27, 2026 16:22
…casts in rotary path (vllm-project#42833)

Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com>
…roject#43550)

Signed-off-by: Aditya Singh <adisin650@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
… & non-streaming paths (vllm-project#43662)

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
…in-aligned W4A16 shapes (vllm-project#43731)

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>
…rgs (vllm-project#43401)

Signed-off-by: Ashwin Giridharan <girida@amazon.com>
Signed-off-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: chunyang.wen <chunyang.wen@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
…ORI_INTERNODE_KERNEL (vllm-project#41751)

Signed-off-by: jatseng-ai <jatseng@amd.com>
…continued) (vllm-project#43361)

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
@amd-callumm amd-callumm force-pushed the callumm.upstream_merge branch from 51987e3 to 6e4fc54 Compare June 1, 2026 19:11
Signed-off-by: Callum Mitchell <callumm@amd.com>

Signed-off-by:  <callumm@amd.com>
@amd-callumm amd-callumm force-pushed the callumm.upstream_merge branch from 6e4fc54 to 112c8cb Compare June 2, 2026 16:20
@amd-callumm amd-callumm marked this pull request as ready for review June 2, 2026 17:23

@mgehre-amd mgehre-amd left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

Build is failing (OOM for skinny int4 GEMM?), please check

@amd-callumm

Copy link
Copy Markdown
Author

Build is failing (OOM for skinny int4 GEMM?), please check

Looking into this now.

@amd-callumm

Copy link
Copy Markdown
Author

Reducing MAX_JOBS from 2 -> 1 in CI seemed to do the trick, at the cost of somewhat slower build times. I'll merge this if the rest of CI looks good.

@amd-callumm amd-callumm merged commit cdd11a6 into gfx11 Jun 2, 2026
4 of 5 checks passed
@eble-amd

eble-amd commented Jun 3, 2026

Copy link
Copy Markdown

@amd-callumm The performance test job failed on runner 'linux-strix-halo-gpu-rocm-8-gpu0-1780422175. I don't pay attention to every job, but every time I've checked, they have failed on runners 6 and 7, and passed on runners 8 and 9. This failure on 8 deserves some follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.