Skip to content

[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859

Merged
vllm-bot merged 112 commits into
vllm-project:mainfrom
fxmarty-amd:upstream-nvfp4-simulated-quark
May 13, 2026
Merged

[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859
vllm-bot merged 112 commits into
vllm-project:mainfrom
fxmarty-amd:upstream-nvfp4-simulated-quark

Conversation

@fxmarty-amd

@fxmarty-amd fxmarty-amd commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

Purpose

https://github.com/amd/Quark/ has experimental nvfp4 support that will be extended in future releases. The PR enables loading in vLLM NVFP4 models (dense and MOE) quantized using Quark library.

Todo:

  • Port the parallel layer scale recomputation logic [won't do - raising an error in case q/k/v projections, gate/up projections weight global scales are not equal].

Test Plan

pytest tests/quantization/test_quark.py -s -vvvvv -k "test_nvfp4_wikitext_correctness"

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@mergify mergify Bot added the rocm Related to AMD ROCm label Mar 3, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Mar 3, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for loading Quark NVFP4 checkpoints in vLLM, including an emulation path for hardware that doesn't natively support NVFP4. The changes are extensive, touching configuration, quantization layers, and tests. A significant part of the work involves refactoring to accommodate the new emulation backend for both dense and MoE layers. While the overall approach is sound, I've identified a critical issue in the handling of quantization scales for the new QuarkNVFP4 scheme which could lead to incorrect model outputs.

Comment thread vllm/model_executor/layers/quantization/quark/schemes/quark_nvfp4.py Outdated
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@fxmarty-amd fxmarty-amd force-pushed the upstream-nvfp4-simulated-quark branch from 6e11ec3 to affdda7 Compare March 3, 2026 12:03
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@fxmarty-amd fxmarty-amd marked this pull request as ready for review March 3, 2026 15:42
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
# Move the E2M1 lookup table to the device now, because
# `.to(device)` is not allowed during CUDA graph capture.
kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(layer.weight.device)
kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(layer.w13_weight.device)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo from #40033 - surprised it slipped in.

@fxmarty-amd fxmarty-amd requested a review from kylesayrs May 4, 2026 12:41
@fxmarty-amd

Copy link
Copy Markdown
Contributor Author

Hi @kylesayrs @mgoin, this PR should be in a good state, appreciate if you are able to have a look again, thank you!

@mgoin mgoin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks straightforward, thanks adopting the kernel/moe interface!

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA May 4, 2026
@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed quantization labels May 4, 2026
@mergify

mergify Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 4, 2026
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@mergify mergify Bot removed the needs-rebase label May 5, 2026
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…sk for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@fxmarty-amd

Copy link
Copy Markdown
Contributor Author

Getting a seemingly unrelated CI failure https://buildkite.com/vllm/ci/builds/64486#019df8a9-125b-43c1-af7f-679765bfef60:

�_bk;t=1777993427576 #23 76.57 /opt/rocm/lib/llvm/bin/clang++  -DCUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL=1 -DHIPBLASLT_USE_ROCROLLER -DPy_LIMITED_API=3 -DTORCH_EXTENSION_NAME=_C -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_PROF_API=1 -DUSE_RPC -DUSE_TENSORPIPE -D_C_EXPORTS -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -I/app/vllm/build/temp.linux-x86_64-cpython-312/csrc -isystem /usr/include/python3.12 -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /opt/rocm-7.2.2/include/hiprand -isystem /opt/rocm-7.2.2/include/rocrand -Wno-unused-result -Wno-unused-value -O2 -g -DNDEBUG -std=gnu++17 --offload-arch=gfx90a --offload-arch=gfx942 --offload-arch=gfx950 -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -DUSE_ROCM -DENABLE_FP8 -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF_OPERATORS__ -Werror=unused-variable -fno-gpu-rdc -DTORCH_HIP_VERSION=702 -Wno-shift-count-negative -Wno-shift-count-overflow -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIPBLASLT_OUTER_VEC -DUSE_ROCM_CK_GEMM -MD -MT CMakeFiles/_C.dir/csrc/cache_kernels.hip.o -MF CMakeFiles/_C.dir/csrc/cache_kernels.hip.o.d -o CMakeFiles/_C.dir/csrc/cache_kernels.hip.o -x hip -c /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/cache_kernels.hip
�_bk;t=1777993427576 #23 76.57 In file included from /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/cache_kernels.hip:13:
�_bk;t=1777993427576 #23 76.57 /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/concat_mla_q.cuh:4:10: fatal error: 'cuda_bf16.h' file not found
�_bk;t=1777993427576 #23 76.57     4 | #include <cuda_bf16.h>
�_bk;t=1777993427576 #23 76.57       |          ^~~~~~~~~~~~~
�_bk;t=1777993427576 #23 76.57 1 error generated when compiling for gfx90a.

@vllm-bot vllm-bot merged commit 4033096 into vllm-project:main May 13, 2026
69 of 71 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA May 13, 2026
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD May 13, 2026
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
…#35859)

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: fxmarty-amd <felmarty@amd.com>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
rishitdholakia13 pushed a commit to rishitdholakia13/vllm that referenced this pull request May 19, 2026
…#35859)

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: fxmarty-amd <felmarty@amd.com>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…#35859)

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: fxmarty-amd <felmarty@amd.com>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026
…#35859)

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: fxmarty-amd <felmarty@amd.com>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…#35859)

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: fxmarty-amd <felmarty@amd.com>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
…#35859)

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: fxmarty-amd <felmarty@amd.com>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia quantization ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants