[Quark] Support loading Quark NVFP4 checkpoints in vLLM by fxmarty-amd · Pull Request #35859 · vllm-project/vllm

fxmarty-amd · 2026-03-03T11:52:04Z

Purpose

https://github.com/amd/Quark/ has experimental nvfp4 support that will be extended in future releases. The PR enables loading in vLLM NVFP4 models (dense and MOE) quantized using Quark library.

Todo:

Port the parallel layer scale recomputation logic [won't do - raising an error in case q/k/v projections, gate/up projections weight global scales are not equal].

Test Plan

pytest tests/quantization/test_quark.py -s -vvvvv -k "test_nvfp4_wikitext_correctness"

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…vfp4-simulation-support-moe

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…fp4-simulation-aot-weight-dequantization

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

gemini-code-assist

Code Review

This pull request adds support for loading Quark NVFP4 checkpoints in vLLM, including an emulation path for hardware that doesn't natively support NVFP4. The changes are extensive, touching configuration, quantization layers, and tests. A significant part of the work involves refactoring to accommodate the new emulation backend for both dense and MoE layers. While the overall approach is sound, I've identified a critical issue in the handling of quantization scales for the new QuarkNVFP4 scheme which could lead to incorrect model outputs.

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd · 2026-05-04T12:40:42Z

        # Move the E2M1 lookup table to the device now, because
        # `.to(device)` is not allowed during CUDA graph capture.
-        kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(layer.weight.device)
+        kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(layer.w13_weight.device)


Typo from #40033 - surprised it slipped in.

fxmarty-amd · 2026-05-04T12:42:15Z

Hi @kylesayrs @mgoin, this PR should be in a good state, appreciate if you are able to have a look again, thank you!

mgoin

Looks straightforward, thanks adopting the kernel/moe interface!

mergify · 2026-05-04T20:15:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…sk for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd · 2026-05-06T12:21:20Z

Getting a seemingly unrelated CI failure https://buildkite.com/vllm/ci/builds/64486#019df8a9-125b-43c1-af7f-679765bfef60:

�_bk;t=1777993427576 #23 76.57 /opt/rocm/lib/llvm/bin/clang++  -DCUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL=1 -DHIPBLASLT_USE_ROCROLLER -DPy_LIMITED_API=3 -DTORCH_EXTENSION_NAME=_C -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_PROF_API=1 -DUSE_RPC -DUSE_TENSORPIPE -D_C_EXPORTS -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -I/app/vllm/build/temp.linux-x86_64-cpython-312/csrc -isystem /usr/include/python3.12 -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /opt/rocm-7.2.2/include/hiprand -isystem /opt/rocm-7.2.2/include/rocrand -Wno-unused-result -Wno-unused-value -O2 -g -DNDEBUG -std=gnu++17 --offload-arch=gfx90a --offload-arch=gfx942 --offload-arch=gfx950 -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -DUSE_ROCM -DENABLE_FP8 -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF_OPERATORS__ -Werror=unused-variable -fno-gpu-rdc -DTORCH_HIP_VERSION=702 -Wno-shift-count-negative -Wno-shift-count-overflow -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIPBLASLT_OUTER_VEC -DUSE_ROCM_CK_GEMM -MD -MT CMakeFiles/_C.dir/csrc/cache_kernels.hip.o -MF CMakeFiles/_C.dir/csrc/cache_kernels.hip.o.d -o CMakeFiles/_C.dir/csrc/cache_kernels.hip.o -x hip -c /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/cache_kernels.hip
�_bk;t=1777993427576 #23 76.57 In file included from /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/cache_kernels.hip:13:
�_bk;t=1777993427576 #23 76.57 /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/concat_mla_q.cuh:4:10: fatal error: 'cuda_bf16.h' file not found
�_bk;t=1777993427576 #23 76.57     4 | #include <cuda_bf16.h>
�_bk;t=1777993427576 #23 76.57       |          ^~~~~~~~~~~~~
�_bk;t=1777993427576 #23 76.57 1 error generated when compiling for gfx90a.

…#35859) Signed-off-by: Felix Marty <Felix.Marty@amd.com> Signed-off-by: fxmarty-amd <felmarty@amd.com> Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>

…#35859) Signed-off-by: Felix Marty <Felix.Marty@amd.com> Signed-off-by: fxmarty-amd <felmarty@amd.com> Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

…#35859) Signed-off-by: Felix Marty <Felix.Marty@amd.com> Signed-off-by: fxmarty-amd <felmarty@amd.com> Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>

fxmarty-amd added 17 commits March 2, 2026 12:16

fix issues with nvfp4 dense emulation in vllm (squash)

b313689

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

address comments

bc6ff39

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

nvfp4 moe emulation support

14bc668

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…

a11d131

…vfp4-simulation-support-moe

wip use TritonExperts

95c6a4a

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

wip cleanup

5a2cf8c

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

wip cleanup

0ea8f82

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

wip cleanup

d99373e

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fix activation quantization

7a5f2ba

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

address comment

457f9df

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

aot weight dequantization

86d6316

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

use emulation_dequantize_weights for quark OCP MX as well

2cb040b

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

tiny fix

7a67180

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

enable test on non-blackwell devices

01b4dce

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-moe' into upstream-nv…

aef916d

…fp4-simulation-aot-weight-dequantization

add test

c4aff81

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

add test

4710a00

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

mergify Bot added the rocm Related to AMD ROCm label Mar 3, 2026

github-project-automation Bot added this to AMD Mar 3, 2026

github-project-automation Bot moved this to Todo in AMD Mar 3, 2026

gemini-code-assist Bot reviewed Mar 3, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/quantization/quark/schemes/quark_nvfp4.py Outdated

support quark dense and moe nvfp4

affdda7

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd force-pushed the upstream-nvfp4-simulated-quark branch from 6e11ec3 to affdda7 Compare March 3, 2026 12:03

fxmarty-amd added 2 commits March 3, 2026 08:13

wip cleanup

da111bd

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

bug fixes and add test

0cc4207

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd marked this pull request as ready for review March 3, 2026 15:42

fxmarty-amd requested review from mgoin, robertgshaw2-redhat, tjtanaa and yewentao256 as code owners March 3, 2026 15:42

fix typo

a1814ad

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd commented May 4, 2026

View reviewed changes

fxmarty-amd requested a review from kylesayrs May 4, 2026 12:41

mgoin approved these changes May 4, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA May 4, 2026

mgoin added ready ONLY add when PR is ready to merge/full CI is needed quantization labels May 4, 2026

mergify Bot added the needs-rebase label May 4, 2026

Merge branch 'main' into upstream-nvfp4-simulated-quark

41be38d

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

mergify Bot removed the needs-rebase label May 5, 2026

fxmarty-amd added 2 commits May 5, 2026 07:12

remove outdated comments

4c729d3

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd added 2 commits May 11, 2026 09:58

Merge branch 'main' into upstream-nvfp4-simulated-quark

199488a

Merge branch 'main' into upstream-nvfp4-simulated-quark

3b28683

mgoin approved these changes May 13, 2026

View reviewed changes

vllm-bot merged commit 4033096 into vllm-project:main May 13, 2026
69 of 71 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA May 13, 2026

github-project-automation Bot moved this from Todo to Done in AMD May 13, 2026

yewentao256 mentioned this pull request May 13, 2026

[CI] Fix pre-commit issue #42563

Merged

cnie-rblx mentioned this pull request May 13, 2026

[Quant] Consolidate GPTQ: rename gptq_marlin.py to auto_gptq.py #38288

Merged

shantipriya-amd mentioned this pull request May 26, 2026

[Bugfix] Add deepseek_v32 to Quark dynamic MXFP4 model type check #39498

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859

[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859
vllm-bot merged 112 commits into
vllm-project:mainfrom
fxmarty-amd:upstream-nvfp4-simulated-quark

fxmarty-amd commented Mar 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

fxmarty-amd May 4, 2026

Uh oh!

fxmarty-amd commented May 4, 2026

Uh oh!

mgoin left a comment •

edited

Loading

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

fxmarty-amd commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

fxmarty-amd commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

fxmarty-amd May 4, 2026

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd commented May 4, 2026

Uh oh!

mgoin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

fxmarty-amd commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fxmarty-amd commented Mar 3, 2026 •

edited

Loading

mgoin left a comment •

edited

Loading