Add support for ModelOpt MXFP8 dense models#33786
Add support for ModelOpt MXFP8 dense models#33786vllm-bot merged 9 commits intovllm-project:mainfrom
Conversation
|
Documentation preview: https://vllm--33786.org.readthedocs.build/en/33786/ |
There was a problem hiding this comment.
Code Review
This pull request adds support for ModelOpt MXFP8 models by introducing a new quantization configuration and associated linear method. The changes are well-structured and add valuable new functionality. My review includes a few points of feedback regarding documentation accuracy, consistency in MoE support, and an opportunity to refactor for code clarity and reuse.
6d66823 to
0e1bb9f
Compare
| "`pip install flashinfer`" | ||
| ) from err | ||
| class Mxfp8Backend(Enum): | ||
| TORCH = "torch" |
There was a problem hiding this comment.
The "torch" backend is temporary (can be used for debug in the future).
Backend FLASHINFER_CUTLASS will be added in the future:
flashinfer-ai/flashinfer#2464
Can be added after the flashinfer PR is merged, and flashinfer version is bumped in vLLM.
0e1bb9f to
40fac14
Compare
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
a42165d to
99c68e2
Compare
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
459ff5a to
8267a99
Compare
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Head branch was pushed to by a user without write access
|
do we have plan to support mxfp8 gemm kernel? |
Yes, I mentioned that in a previous comment, see this Flashinfer PR: |
* Implement zero-copy GQA for multimodal and CPU (#33732)
Signed-off-by: Taeksang Kim <ts.kim@hyperaccel.ai>
* [Bugfix] Support `RotaryEmbedding` CustomOp for gpt-oss (#33800)
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
* [Model] Add transcription support for Qwen3-Omni (#29828)
Signed-off-by: Muhammad Hashmi <mhashmi@berkeley.edu>
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
* Revert "[torch.compile] Significantly speed up cold start times" (#33820)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* Change the type signature of MixtureOfExperts.expert_weights to MutableSequence[Sequence[Tensor]] (#33573)
Signed-off-by: Sage Moore <sagmoore@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Core] Don't schedule spec tokens with prefill chunks (#33652)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* feat: Add ColBERT late interaction model support (#33686)
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com>
Signed-off-by: Ilya Boytsov <boytsovpanamera@mail.ru>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
* [CI][torch.compile] Reduce e2e fusion test time (#33293)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [Bugfix] Disable TRTLLM attention when KV transfer is enabled (#33192)
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
* [Bugfix] fix DeepSeek R1 with CUTLASS MLA Broken on B200 (#33637)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
* [release] Minor fixes to release annotation (#33849)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
* [CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode (#32762)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* Revert "[Attention][FA3] Update FA3 to include new swizzle optimization" (#33841)
* [Minor] Include `StreamingInput` in inputs package (#33856)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [docs] fix unintentional misspellings (#33863)
Signed-off-by: rinbaro <ilgomishra@gmail.com>
* [CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly (#33840)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
* [2/N] move responses/serving _make_response_output_items logic to parser (#33281)
Signed-off-by: Andrew Xia <axia@fb.com>
Signed-off-by: Andrew Xia <axia@meta.com>
Co-authored-by: Andrew Xia <axia@fb.com>
* [CI/Build] Parallelize CPU CI tests (#33778)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
* [Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result (#33837)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
* [CPU][BugFix] Allow w8a8 oneDNN quantized matmul to support 3D inputs (#33727)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
* [CI/Build] Fix CPU CI test case title (#33870)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
* [Perf] Optimize the performance of structured output + reasoning (#33557)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
* [KV Connector][Metrics] Do not count local prefix cache hits in connector queries (#30522)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
* [Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. (#33858)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
* [Refactor] Move `task` outside of `PoolingParams.verify` (#33796)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
* [ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) (#32710)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
* Enable Cross layers KV cache layout at NIXL Connector V2 (#33339)
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
* [perf] Integrate flashinfer concat_mla_k (#31171)
* [Bugfix] Fix Kimi-K2.5 NVFP4 checkpoints weight loading (#33876)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Refactor] Clean up input preprocessing (#33687)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix] Fix corner case of sparse embedding (#33886)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
* [Docs] Add bart-plugin to docs (#33905)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Bugfix] Fix step3p5 parser when using mtp (#33690)
Signed-off-by: mariohong <mariohong128@gmail.com>
* [Feat][RL][1/2] Native Weight Syncing API: NCCL (#31943)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
* [BugFix] Fix LoRA Fp8 (#33879)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
* [Spec Decode] Unified Parallel Drafting (#32887)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
* [Misc] Add debug logs (#33931)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Bugfix] Fix swapped engine_ids in NIXL Llama 4 local attention path (#33795)
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
* [Moe Refactor] Make Inplace Flag for FusedMoEModularKernel part of the constructor (#33375)
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Models] Consolidate Deepseek-OCR2 processor (#33909)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Bugfix] Suppress non-TTY color output on the process name part of the log (#29714)
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
* Fix tokenizer test for renamed attr on Transformers v5 (#33902)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Misc] Rename `translations` to `speech_to_text` for OAI serving component (#33904)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Bugfix] Fix DSV3.2 NVFP4 (#33932)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
* [Bugfix] Make MM batching more robust (#33817)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Minor] Sort safetensors files to ensure deterministic loading order (#33491)
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
* Adds padding and perf improvements to wvSplitK_fp8 (#33527)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
* [Bugfix] Fix DeepSeek v3.2 tokenizer outputting None issue (#33832)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
* [Feature] OTEL tracing during loading (#31162)
* [Perf] Disable clean_logits in deepgemm fp8_mqa_logits kernel (#33568)
* [Docs] Add reo analytics (#33957)
Signed-off-by: simon-mo <simon.mo@hey.com>
* fix(ROCm): Make flash_attn import optional in MLA attention (#33511)
Signed-off-by: rabi <ramishra@redhat.com>
* feat(frontend): early-fail tokenization guard for user requests (#31366)
Signed-off-by: limingliang <limingliang@stepfun.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: limingliang <limingliang@stepfun.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Misc] Update code for encoder-decoder models (#33900)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CPU] Add BF16 Kernel type for s390x (#33788)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
* [XPU][4/N] add mxfp4 moe model support (#33679)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
* [XPU]Replace pip in docker.xpu with uv pip (#31112)
Signed-off-by: sihao.li <sihao.li@intel.com>
* Onboard voyage-4-nano (#33720)
Signed-off-by: Chengcheng Pei <chengchengpei@outlook.com>
Signed-off-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation (#32263)
Signed-off-by: Gassan <gassan.salama@arm.com>
* Fix `main` pre-commit (#33975)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* support view_from_cpu_tensor on XPU (#33868)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
* Consolidate and fix forbidden import `pre-commit` checks (#33982)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [PaddleOCR-VL] Add BC for transformers 5.0 config (#33976)
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
* Bump HF Hub client to get bug fix (#33984)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [CPU][BugFix] Fix loading of w8a8int models with bias (#33582)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
* [torch.compile] Reorganize vllm/compilation and tests/compile (0/N for vLLM IR) (#33731)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [Bugfix][Model] Support LoRA on Qwen3 Output Embedding (#29816)
Signed-off-by: kurt <kurt@thinkingmachines.ai>
* [Docs] Improve documentation (#33799)
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* Update `WeightTransferConfig` to be more standard like the others (#33989)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Bugfix] Fix models and tests for transformers v5 (#33977)
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [FIX] guidance: use max(vocab_size, len(tokenizer)) for n_vocab (#33509)
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch>
* [ROCm][AITER] Fix AITER import regression for explicit backend selection (#33749)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Docs] Add sections on process architecture and minimum CPU resources (#33940)
It seems users can be confused about vLLM's performance when running
with very small amounts of CPU cores available. We are missing a clear
overview of what vLLM's process architecture is, so I added this along with
some diagrams in arch_overview.md, and included a section on CPU resource
recommendations in optimization.md
Signed-off-by: mgoin <mgoin64@gmail.com>
* [Model] Support MiniCPM-o 4.5 (#33431)
Signed-off-by: caitianchi <caitianchi@modelbest.cn>
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Co-authored-by: mslv <mslv@baai.ac.cn>
* [Refactor] Consolidate sequence normalization and enc-dec parsing (#33928)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [XPU][5/N] add wna16 xpu kernel (#33973)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
* [Docs] Update link to Benchmark CLI documentation (#33254)
Signed-off-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>
* [Bugfix] Fix the issue where tool calling does not work when using fast detokenization with dsv32 (#33964)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
* [Log] Optimize duplicate startup log (#33944)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [KV Connector] Add missing method overrides to MultiConnector (#33292)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
* [DOC] [ROCm] Update docker deployment doc (#33971)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Model Runner V2] support apply penalty for spec decode (#33251)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
* [Refactor] Remove align block size logic in `moe_permute` (#33449)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [Rocm][Bugfix] Fix dtype not same for gemm_a4w4 op (#33734)
Signed-off-by: charlifu <charlifu@amd.com>
* [Bugfix] Fix no attribute error of SharedFusedMoE (DeepSeek-V3.1 as test model) (#33993)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
* [Fix] Fix `logprobs=0` handling for `/inference/v1/generate` endpoint (#34010)
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
* Fix RoutingMethodType logic (#33919)
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
* [bugfix] [ROCm] Fix premature CUDA initialization in platform detection (#33941)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
* [Feat][RL] Pause and Resume with keep requests for single engine (#32351)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Bugfix] Fix QK Norm+RoPE fusion pattern matching on B200+FP8 (#33967)
Signed-off-by: Ikenna <ikennachifo@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [Bugfix] Fix Whisper tokenization (#34011)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [CI][AMD]Bugfix] Check that model_config is not None in enable_norm_pad_fusion (#34007)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
* [Bugfix] Fix _fused_moe_lora_expand signature mismatch (#33821)
Signed-off-by: Xin Yang <xyangx@amazon.com>
* [Misc] Add backward-compatible import aliases for renamed translations module (#34015)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* [ModelRunner V2] Revert token rank comparison difference for now (#34017)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* fix description in plugin_system.md (#33999)
* [Revert] Add util `handle_deprecated` back (#33998)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [Kernel] Add enable_sm120_or_later for SM121 (DGX Spark) CUTLASS support (#33517)
Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
* [Misc] Make `PlaceholderRange.get_num_embeds` a method (#34035)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [ROCm][CI] Pinning lm-eval version to resolve multi-modal small eval bug (#34038)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* Fix spelling errors (#33978)
* [Misc] Simplify `get_max_tokens` (#34036)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CI][Build] Pin grpcio-tools==1.78.0 (#34048)
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
* [Renderer] Define `render_cmpl` and `render_chat` (#34039)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Kernel] Add KernelConfig flag to enable/disable FlashInfer autotune (#34006)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [torch.compile] Stop compiling identical artifacts (#34003)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* Enable Eagle3 speculative decoding for Mistral3ForConditionalGeneration to support eagle3 (#33939)
Signed-off-by: Akintunde Oladipo <akintunde.oladipo@servicenow.com>
Signed-off-by: TundeAtSN <akintunde.oladipo@servicenow.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Frontend]Add support for transcriptions and translations to run_batch (#33934)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
* [Model] Enable Step3p5ForCausalLM testing (#33755)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
* [PluggableLayer][3/N] Apply PluggableLayer to mamba layers. (#33660)
Signed-off-by: whx-sjtu <2952154980@qq.com>
* move checks out of `unified_kv_cache_update` custom op (#33943)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
* Update DeepGEMM version pin in Dockerfile to match #32479 (#33935)
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* Make directory exist ok for ray spinning up multiple replicas on a single instance (#33604)
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
* Perf tuning and expansion of cases covered for wvSplitKrc (#33493)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
* [Doc] Fix run_batch docs (#34056)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CI/Build] Skip GCS test (#34057)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [ROCm][Bugfix] fix act_quant_fusion module import error (#34069)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Perf] Simplify DeepseekV32 tokenizer, ensure fast detokenization used (#33855)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [ROCm] [CI] Reduce Resource of two test groups (#34059)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
* Add embedding input functionality for disabled modalities [remake] (#32493)
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Signed-off-by: Reagan Lee <reaganjlee@gmail.com>
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate (#33771)
Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
* [BugFix] Change support no act and mul for marlin (#34088)
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
* [torch.compile] Add an option to force-enable the MOE cold start optimization (#33735)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* glm 4.6 fused tuned inference config for B200 (#32958)
* Add support for ModelOpt MXFP8 dense models (#33786)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
* [Release 2.10] Update to Torch 2.10 - final release (#30525)
* [bug-fix] supported_tasks is breaking backward compatibility at init_app_state (#34027)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
* [Tiny] Rename encoder budget file to more specific name (#34103)
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
* [Frontend][last/5] Make pooling entrypoints request schema consensus. (#31127)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
* [BugFix] Fix `fastsafetensors` TP all procs using all GPUs (#34070)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
* fix(cpu): fix mla_decode compilation on x86 without AVX512 (#34052)
Signed-off-by: ihb2032 <hebome@foxmail.com>
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain>
* [Model] GLM adaptation (#34124)
* [CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr (#34107)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [ASR] Fix audio benchmark and add RTFx metric (#32300)
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
* [Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul (#33901)
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com>
* [XPU][6/N] add xpu scaled_mm kernel (#34117)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
* [MODEL] Adding Support for Qwen3.5 Models (#34110)
Signed-off-by: JJJYmmm <1650675829@qq.com>
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: wulipc <wulipc@users.noreply.github.com>
Co-authored-by: ywang96 <ywang96@users.noreply.github.com>
Co-authored-by: Isotr0py <Isotr0py@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
* [Misc] Fix up attention benchmarks (#33810)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
* [UX] Add `--language-model-only` for hybrid models (#34120)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 (#34031)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
* Add NUMA Core binding in nixl_connector for CPU xPyD (#32365)
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com>
Signed-off-by: ZhengHongming888 <hongming.zheng@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Kernel] FlashInfer: switch allreduce fusion to unified API (#33985)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
* [Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) (#34087)
Signed-off-by: Tomer Natan <tbarnatan@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* [Kernel] use flashinfer for gdn prefill (#32846)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
* [Bugfix] Avoid duplicate k-proj weight emission in helper (#34142)
Signed-off-by: Artus KG <artuskg@gmail.com>
* [Bugfix] Voxtral prompt/audio placeholder alignment (#34140)
Signed-off-by: Artus KG <artuskg@gmail.com>
* [ROCm] update triton branch to support gpt-oss models for gfx11xx devices (#34032)
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
* [torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. (#33945)
Signed-off-by: charlifu <charlifu@amd.com>
* [ModelRunner V2][BugFix] Fix `max_query_len` calculation (#34167)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [Doc] Add DCP support to attention backend doc (#33936)
* [Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available (#34153)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
* [structured output] validate unsupported json features first (#33233)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
* [LMCache] Token Base IPC API (#34175)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
* [Bugfix] Adopt `ChunkGatedDeltaRule` for Qwen3.5 (#34198)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [ROCm][Bugfix] Resolve Dynamo tracing crash from amdsmi calls in on_gfx* arch detection (#34108)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching (#34183)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [Doc] Update usage of `--limit-mm-per-prompt` (#34148)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CI/Build] Relax `test_mcp_tool_call` (#34204)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix] Fix DP Attention Padding in Dummy Run (#34187)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
* [Bugfix] Add `--trust-remote-code` to dataset bench args (#34208)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [responsesAPI] fix simpleContext streaming output_messages (#34188)
Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
* [Bugfix] Sort hf_weights_files in fastsafetensors_weights_iterator to match #33491 (#34190)
Signed-off-by: Balaxxe <136368465+jaim12005@users.noreply.github.com>
* [Frontend][CI] Consolidate instrumentator entrypoints (#34123)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
* [BugFix] Avoid prefix cache hit in the same schedule step for mamba layers (#29387)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
* [Perf] Optimize detokenizer python logic (#32975)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
* Revert #34208 (#34216)
* [Bugfix] Fix memory inconsistency in cross-process shared memory (#32022)
Signed-off-by: Zetong Li <slippersss@126.com>
* [Bugfix] Fix `--trust-remote-code` conflict (#34218)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Docs] Fix format error in KV load failure recovery doc (#34137)
Signed-off-by: Jaebok Lee <jaebok9541@naver.com>
* [Bugfix] Fix FI kernel`chunk_gated_delta_rule` output shape for Qwen3.5 (#34219)
Signed-off-by: Roger Wang <hey@rogerw.io>
* Add flagos in MiniCPM-o (#34126)
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Signed-off-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
Co-authored-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
* [Misc] allow specify is_mm_prefix_lm in hf_config (#34215)
* Stop testing for slow tokenizers as they will not exist soon (#34235)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input (#34220)
Signed-off-by: KrxGu <krishom70@gmail.com>
* Bump `mamba-ssm` version in CI for Transformers v5 compatibility (#34233)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* add --insecure arg to the vllm bench to skip TLS (#34026)
Signed-off-by: Fan Yang <yan9fan@meta.com>
Co-authored-by: Fan Yang <yan9fan@meta.com>
* Support benchmarking of Geospatial models (#33922)
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
* [ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations (#29008)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [compile] Enable AOT compile with 2.10 in trunk. (#34155)
Signed-off-by: Zhengxu Chen <zhxchen17@meta.com>
* [Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention (#33680)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* [Core][BugFix] Fix PP KV cache sharding memory validation (#33698)
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
* [BUGFIX] Fix accuracy bugs in Qwen3-Next MTP (#34077)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
* [Docs] Speed up build environment set-up (#34240)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Model Runner V2] Use pinned memory for write_contents (#34222)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
* Minor cleanup for Voxtral (#34247)
Signed-off-by: Andy Lo <andy@mistral.ai>
* [UX nit] Fix non-default api_server_count message (#34152)
Signed-off-by: mgoin <mgoin64@gmail.com>
* [Misc] Introduce ec_both role EC (encoder cache) connector (#34182)
Signed-off-by: Qi Wang <qiwa@nvidia.com>
* Convert online APIs to use Renderer (#34084)
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
* [Bugfix] Fix weights offloading for sleep mode (#32947)
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
* [Benchmarks] Fix attention benchmark smoke test (#34269)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
* [Bugfix] Fix mamba cache dtype for Qwen3.5 (#34200)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [SM100] Resubmit FMHA FP8 prefill for MLA (#31195)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
* [Feature] Warn about unrecognized environment variables (#33581)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
* [Perf] Move eplb rebalance algo to async thread (#30888)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
* [BugFix] Fix async EPLB hang with DeepEP LL all2all backend (#32860)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
* [Misc][Spec Decode] support different load config for draft model (#34022)
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
* [torch.compile] Disable recursive pre_grad_passes (#34092)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* [Misc] Add pre-commit hook to catch boolean ops in with-statements (#34271)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* [CI] Add pip caching to cleanup_pr_body workflow (#32979)
Signed-off-by: 7. Sun <jhao.sun@gmail.com>
* [MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner (#32344)
Signed-off-by: Bill Nell <bnell@redhat.com>
* [ROCm][CI] Fix test_sequence_parallel.py location in AMD CI pipeline (#34280)
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
* [Misc] Add run one batch script that supports profiling (#32968)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
* [Bugfix] Fix Worker.load_model context-manager composition for sleep mode (#34021)
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com>
* [Redo] Add `--trust-remote-code` to dataset bench args (#34251)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [torch.compile] Stop doing unnecessary FakeTensorProp in PiecewiseCompileInterpreter (#34093)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* [WideEP] Fix nvfp4 DeepEP High Throughput All2All backend (#33738)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Misc] Clean up validation logic in input processor (#34144)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast (#33884)
Signed-off-by: Kebe <mail@kebe7jun.com>
* [Kernel] Apply 256bit LDG/STG To Activation Kernels (#33022)
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* [XPU][7/N] enable xpu fp8 moe (#34202)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
* [Plugin] Simplify IO Processor Plugin interface (#34236)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 (#34149)
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
* Threshold fix wvSplitk for occasional CI fails (#34013)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
* [ModelBash][DSR1 NVFp4] Removed Bf16 Bias Cast (#34298)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
* [Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides (#34279)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* [Bugfix] Fix weight naming in Qwen3.5 (#34313)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [CPU] Enable FP16 (Half dtype) support for s390x (#34116)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
* [model] support FunASR model (#33247)
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
* [XPU][9/N] clean up existing ipex code/doc (#34111)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
* [Chore] Move `BaseRenderer` to `base.py` (#34308)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [torch.compile] Enable AR+rms fusion by default available for `-O2` (#34299)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
* [Misc] Bump `fastsafetensors` version for latest fixes (#34273)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [Doc] Update Marlin support matrix for Turing (#34319)
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com>
* [Frontend] Exploit tokenizers "new stream" in FastIncrementalDetokenizer (#34217)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* Patch protobuf for CVE-2026-0994 (#34253)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
* [Docs] Reduce time spent generating API docs (#34255)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Bugfix][CPU] Fix llama4 inference on CPU (#34321)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
* Make Qwen3VL compatible with Transformers v5 (#34262)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
* Make JAIS compatible with Transformers v5 (#34264)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE (#33715)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
* Responses harmony system message structured (#34268)
Signed-off-by: Adam Binford <adamq43@gmail.com>
* Reapply [Attention][FA3] Update FA3 to include new swizzle optimization (#34043)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
* Don't try and run GLM-ASR with remote code (#34352)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache (#33948)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
* [ROCm] [aiter] Split KV cache update for AiterFlashAttention (#33681)
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
* [Docs] Fix typo ("defult") and double spacing (#34348)
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… (#32458)
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
* [Model Runner V2] Init cuda graph pool when necessary (#33217)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
* [Multimodal] Expose `mm_processor_kwargs` for `DummyInputsBuilder` (#34330)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Bugfix] fix default is_neox_style is True for deepseek (#34353)
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
* [Bugfix] Enable attn quantization of Llama-4 by correctly permuting scales for rope (int8, fp8) (#34243)
Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
* [ROCm] [CI] fix test_unrecognized_env (#34350)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
* [GPT-OSS] Remove unnecessary contiguous (#34337)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
* Add cartridge (prefix) benchmark configs to CI workflows
The prefix_latency and prefix_throughput configs existed but weren't
being run by any workflow. Each benchmark workflow now runs both the
base and cartridge configs using the shared server support.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* update flashinfer
* update wheel
* update cuda and flashinfer
* downgrade
* update tests
---------
Signed-off-by: Taeksang Kim <ts.kim@hyperaccel.ai>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: Muhammad Hashmi <mhashmi@berkeley.edu>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Richard Zou <zou3519@gmail.com>
Signed-off-by: Sage Moore <sagmoore@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com>
Signed-off-by: Ilya Boytsov <boytsovpanamera@mail.ru>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: rinbaro <ilgomishra@gmail.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Andrew Xia <axia@fb.com>
Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: mariohong <mariohong128@gmail.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: rabi <ramishra@redhat.com>
Signed-off-by: limingliang <limingliang@stepfun.com>
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: sihao.li <sihao.li@intel.com>
Signed-off-by: Chengcheng Pei <chengchengpei@outlook.com>
Signed-off-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Signed-off-by: Gassan <gassan.salama@arm.com>
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
Signed-off-by: kurt <kurt@thinkingmachines.ai>
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch>
Signed-off-by: caitianchi <caitianchi@modelbest.cn>
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
Signed-off-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Ikenna <ikennachifo@gmail.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Signed-off-by: Akintunde Oladipo <akintunde.oladipo@servicenow.com>
Signed-off-by: TundeAtSN <akintunde.oladipo@servicenow.com>
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Signed-off-by: Reagan Lee <reaganjlee@gmail.com>
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Signed-off-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Signed-off-by: ihb2032 <hebome@foxmail.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com>
Signed-off-by: JJJYmmm <1650675829@qq.com>
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com>
Signed-off-by: ZhengHongming888 <hongming.zheng@intel.com>
Signed-off-by: Tomer Natan <tbarnatan@nvidia.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Artus KG <artuskg@gmail.com>
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Balaxxe <136368465+jaim12005@users.noreply.github.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Zetong Li <slippersss@126.com>
Signed-off-by: Jaebok Lee <jaebok9541@naver.com>
Signed-off-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
Signed-off-by: KrxGu <krishom70@gmail.com>
Signed-off-by: Fan Yang <yan9fan@meta.com>
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Zhengxu Chen <zhxchen17@meta.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Andy Lo <andy@mistral.ai>
Signed-off-by: Qi Wang <qiwa@nvidia.com>
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
Signed-off-by: 7. Sun <jhao.sun@gmail.com>
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com>
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com>
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Signed-off-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Taeksang Kim <voidbag@gmail.com>
Co-authored-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com>
Co-authored-by: Muhammad Hashmi <105992724+mu-hashmi@users.noreply.github.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com>
Co-authored-by: Sage Moore <sagmoore@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Ilya Boytsov <boytsovpanamera@mail.ru>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: zhanqiuhu <49648934+ZhanqiuHu@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Co-authored-by: Andreas Karatzas <akaratza@amd.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: rinbaro <ilgomishra@gmail.com>
Co-authored-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Andrew Xia <axia@meta.com>
Co-authored-by: Andrew Xia <axia@fb.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Mario Hong <86880754+mariohong128@users.noreply.github.com>
Co-authored-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: danisereb <daserebrenik@nvidia.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
Co-authored-by: zackyoray <yorayz@nvidia.com>
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com>
Co-authored-by: Tsukasa OI <floss_llm@irq.a4lg.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Lumosis <30372757+Lumosis@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Hashem Hashemi <159079214+amd-hhashemi@users.noreply.github.com>
Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Co-authored-by: emricksini-h <emrick.birivoutin@hcompany.ai>
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Rabi Mishra <ramishra@redhat.com>
Co-authored-by: Mingliang Li <limingliang0527@gmail.com>
Co-authored-by: limingliang <limingliang@stepfun.com>
Co-authored-by: R3hankhan <Rehan.Khan7@ibm.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: sihao_li <165983188+1643661061leo@users.noreply.github.com>
Co-authored-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Gassan Salama <gassan.salama@arm.com>
Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com>
Co-authored-by: zhang-prog <69562787+zhang-prog@users.noreply.github.com>
Co-authored-by: Kurt Shuster <shuster.kurt@gmail.com>
Co-authored-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Co-authored-by: FredericOdermatt <50372080+FredericOdermatt@users.noreply.github.com>
Co-authored-by: tc-mb <157115220+tc-mb@users.noreply.github.com>
Co-authored-by: mslv <mslv@baai.ac.cn>
Co-authored-by: zofia <110436990+zufangzhu@users.noreply.github.com>
Co-authored-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>
Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com>
Co-authored-by: Charlie Fu <charlifu@amd.com>
Co-authored-by: xuebwang-amd <xuebwang@amd.com>
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <dbari@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Ikenna <ikennachifo@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: 果冻虾仁 <guodong@apache.org>
Co-authored-by: Vel <110626982+Code4me2@users.noreply.github.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Co-authored-by: TundeAtSN <akintunde.oladipo@servicenow.com>
Co-authored-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: zifeitong <zifeitong@gmail.com>
Co-authored-by: Jiang Wu <jwu@cclgroup.com>
Co-authored-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: aabbccddwasd <140953076+aabbccddwasd@users.noreply.github.com>
Co-authored-by: TomerBN-Nvidia <tbarnatan@nvidia.com>
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Co-authored-by: navmarri14 <nmarri@roblox.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
Co-authored-by: ihb2032 <40718643+ihb2032@users.noreply.github.com>
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain>
Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com>
Co-authored-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com>
Co-authored-by: wulipc <wulipc@users.noreply.github.com>
Co-authored-by: ywang96 <ywang96@users.noreply.github.com>
Co-authored-by: Isotr0py <Isotr0py@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: ZhengHongming888 <hongming.zheng@intel.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: Artus Krohn-Grimberghe <artuskg@users.noreply.github.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Ning Xie <andy.xning@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Yuwei An <ayw.sirius19@gmail.com>
Co-authored-by: Balaxxe <136368465+jaim12005@users.noreply.github.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Zetong Li <48438720+slippersss@users.noreply.github.com>
Co-authored-by: zzaebok <44357534+zzaebok@users.noreply.github.com>
Co-authored-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
Co-authored-by: Phúc H. Lê Khắc <lkhphuc@pm.me>
Co-authored-by: Krish Gupta <krishom70@gmail.com>
Co-authored-by: Fan Yang <fanyang.real@gmail.com>
Co-authored-by: Fan Yang <yan9fan@meta.com>
Co-authored-by: mgazz <michele.gazzetti1@ibm.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Zhengxu Chen <zhxchen17@meta.com>
Co-authored-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Andy Lo <andy@mistral.ai>
Co-authored-by: Qi Wang <wqstu1@gmail.com>
Co-authored-by: J Seppänen <83203+jseppanen@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Ilya Markov <markovilya197@gmail.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Zhengkai Zhang <33679250+ZhengkaiZ@users.noreply.github.com>
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
Co-authored-by: 7. Sun <jhao.sun@gmail.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: tianshu-Michael-yu <101950379+tianshu-Michael-yu@users.noreply.github.com>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Co-authored-by: Matthias Gehre <matthias.gehre@amd.com>
Co-authored-by: AllenDou <allen.dou@hotmail.com>
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Tianqi Ren <tianqi.r@outlook.com>
Co-authored-by: Linda <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: Adam Binford <adamq43@gmail.com>
Co-authored-by: kliuae <17350011+kliuae@users.noreply.github.com>
Co-authored-by: Xinyu Dong <dongxinyu03@baidu.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
* [Bugfix] fix DeepSeek R1 with CUTLASS MLA Broken on B200 (#33637)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
* [release] Minor fixes to release annotation (#33849)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
* [CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode (#32762)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* Revert "[Attention][FA3] Update FA3 to include new swizzle optimization" (#33841)
* [Minor] Include `StreamingInput` in inputs package (#33856)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [docs] fix unintentional misspellings (#33863)
Signed-off-by: rinbaro <ilgomishra@gmail.com>
* [CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly (#33840)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
* [2/N] move responses/serving _make_response_output_items logic to parser (#33281)
Signed-off-by: Andrew Xia <axia@fb.com>
Signed-off-by: Andrew Xia <axia@meta.com>
Co-authored-by: Andrew Xia <axia@fb.com>
* [CI/Build] Parallelize CPU CI tests (#33778)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
* [Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result (#33837)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
* [CPU][BugFix] Allow w8a8 oneDNN quantized matmul to support 3D inputs (#33727)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
* [CI/Build] Fix CPU CI test case title (#33870)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
* [Perf] Optimize the performance of structured output + reasoning (#33557)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
* [KV Connector][Metrics] Do not count local prefix cache hits in connector queries (#30522)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
* [Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. (#33858)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
* [Refactor] Move `task` outside of `PoolingParams.verify` (#33796)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
* [ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) (#32710)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
* Enable Cross layers KV cache layout at NIXL Connector V2 (#33339)
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
* [perf] Integrate flashinfer concat_mla_k (#31171)
* [Bugfix] Fix Kimi-K2.5 NVFP4 checkpoints weight loading (#33876)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Refactor] Clean up input preprocessing (#33687)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix] Fix corner case of sparse embedding (#33886)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
* [Docs] Add bart-plugin to docs (#33905)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Bugfix] Fix step3p5 parser when using mtp (#33690)
Signed-off-by: mariohong <mariohong128@gmail.com>
* [Feat][RL][1/2] Native Weight Syncing API: NCCL (#31943)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
* [BugFix] Fix LoRA Fp8 (#33879)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
* [Spec Decode] Unified Parallel Drafting (#32887)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
* [Misc] Add debug logs (#33931)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Bugfix] Fix swapped engine_ids in NIXL Llama 4 local attention path (#33795)
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
* [Moe Refactor] Make Inplace Flag for FusedMoEModularKernel part of the constructor (#33375)
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Models] Consolidate Deepseek-OCR2 processor (#33909)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Bugfix] Suppress non-TTY color output on the process name part of the log (#29714)
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
* Fix tokenizer test for renamed attr on Transformers v5 (#33902)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Misc] Rename `translations` to `speech_to_text` for OAI serving component (#33904)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Bugfix] Fix DSV3.2 NVFP4 (#33932)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
* [Bugfix] Make MM batching more robust (#33817)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Minor] Sort safetensors files to ensure deterministic loading order (#33491)
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
* Adds padding and perf improvements to wvSplitK_fp8 (#33527)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
* [Bugfix] Fix DeepSeek v3.2 tokenizer outputting None issue (#33832)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
* [Feature] OTEL tracing during loading (#31162)
* [Perf] Disable clean_logits in deepgemm fp8_mqa_logits kernel (#33568)
* [Docs] Add reo analytics (#33957)
Signed-off-by: simon-mo <simon.mo@hey.com>
* fix(ROCm): Make flash_attn import optional in MLA attention (#33511)
Signed-off-by: rabi <ramishra@redhat.com>
* feat(frontend): early-fail tokenization guard for user requests (#31366)
Signed-off-by: limingliang <limingliang@stepfun.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: limingliang <limingliang@stepfun.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Misc] Update code for encoder-decoder models (#33900)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CPU] Add BF16 Kernel type for s390x (#33788)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
* [XPU][4/N] add mxfp4 moe model support (#33679)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
* [XPU]Replace pip in docker.xpu with uv pip (#31112)
Signed-off-by: sihao.li <sihao.li@intel.com>
* Onboard voyage-4-nano (#33720)
Signed-off-by: Chengcheng Pei <chengchengpei@outlook.com>
Signed-off-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation (#32263)
Signed-off-by: Gassan <gassan.salama@arm.com>
* Fix `main` pre-commit (#33975)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* support view_from_cpu_tensor on XPU (#33868)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
* Consolidate and fix forbidden import `pre-commit` checks (#33982)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [PaddleOCR-VL] Add BC for transformers 5.0 config (#33976)
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
* Bump HF Hub client to get bug fix (#33984)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [CPU][BugFix] Fix loading of w8a8int models with bias (#33582)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
* [torch.compile] Reorganize vllm/compilation and tests/compile (0/N for vLLM IR) (#33731)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [Bugfix][Model] Support LoRA on Qwen3 Output Embedding (#29816)
Signed-off-by: kurt <kurt@thinkingmachines.ai>
* [Docs] Improve documentation (#33799)
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* Update `WeightTransferConfig` to be more standard like the others (#33989)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Bugfix] Fix models and tests for transformers v5 (#33977)
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [FIX] guidance: use max(vocab_size, len(tokenizer)) for n_vocab (#33509)
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch>
* [ROCm][AITER] Fix AITER import regression for explicit backend selection (#33749)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Docs] Add sections on process architecture and minimum CPU resources (#33940)
It seems users can be confused about vLLM's performance when running
with very small amounts of CPU cores available. We are missing a clear
overview of what vLLM's process architecture is, so I added this along with
some diagrams in arch_overview.md, and included a section on CPU resource
recommendations in optimization.md
Signed-off-by: mgoin <mgoin64@gmail.com>
* [Model] Support MiniCPM-o 4.5 (#33431)
Signed-off-by: caitianchi <caitianchi@modelbest.cn>
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Co-authored-by: mslv <mslv@baai.ac.cn>
* [Refactor] Consolidate sequence normalization and enc-dec parsing (#33928)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [XPU][5/N] add wna16 xpu kernel (#33973)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
* [Docs] Update link to Benchmark CLI documentation (#33254)
Signed-off-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>
* [Bugfix] Fix the issue where tool calling does not work when using fast detokenization with dsv32 (#33964)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
* [Log] Optimize duplicate startup log (#33944)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [KV Connector] Add missing method overrides to MultiConnector (#33292)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
* [DOC] [ROCm] Update docker deployment doc (#33971)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Model Runner V2] support apply penalty for spec decode (#33251)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
* [Refactor] Remove align block size logic in `moe_permute` (#33449)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [Rocm][Bugfix] Fix dtype not same for gemm_a4w4 op (#33734)
Signed-off-by: charlifu <charlifu@amd.com>
* [Bugfix] Fix no attribute error of SharedFusedMoE (DeepSeek-V3.1 as test model) (#33993)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
* [Fix] Fix `logprobs=0` handling for `/inference/v1/generate` endpoint (#34010)
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
* Fix RoutingMethodType logic (#33919)
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
* [bugfix] [ROCm] Fix premature CUDA initialization in platform detection (#33941)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
* [Feat][RL] Pause and Resume with keep requests for single engine (#32351)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Bugfix] Fix QK Norm+RoPE fusion pattern matching on B200+FP8 (#33967)
Signed-off-by: Ikenna <ikennachifo@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [Bugfix] Fix Whisper tokenization (#34011)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [CI][AMD]Bugfix] Check that model_config is not None in enable_norm_pad_fusion (#34007)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
* [Bugfix] Fix _fused_moe_lora_expand signature mismatch (#33821)
Signed-off-by: Xin Yang <xyangx@amazon.com>
* [Misc] Add backward-compatible import aliases for renamed translations module (#34015)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* [ModelRunner V2] Revert token rank comparison difference for now (#34017)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* fix description in plugin_system.md (#33999)
* [Revert] Add util `handle_deprecated` back (#33998)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [Kernel] Add enable_sm120_or_later for SM121 (DGX Spark) CUTLASS support (#33517)
Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
* [Misc] Make `PlaceholderRange.get_num_embeds` a method (#34035)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [ROCm][CI] Pinning lm-eval version to resolve multi-modal small eval bug (#34038)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* Fix spelling errors (#33978)
* [Misc] Simplify `get_max_tokens` (#34036)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CI][Build] Pin grpcio-tools==1.78.0 (#34048)
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
* [Renderer] Define `render_cmpl` and `render_chat` (#34039)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Kernel] Add KernelConfig flag to enable/disable FlashInfer autotune (#34006)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [torch.compile] Stop compiling identical artifacts (#34003)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* Enable Eagle3 speculative decoding for Mistral3ForConditionalGeneration to support eagle3 (#33939)
Signed-off-by: Akintunde Oladipo <akintunde.oladipo@servicenow.com>
Signed-off-by: TundeAtSN <akintunde.oladipo@servicenow.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Frontend]Add support for transcriptions and translations to run_batch (#33934)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
* [Model] Enable Step3p5ForCausalLM testing (#33755)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
* [PluggableLayer][3/N] Apply PluggableLayer to mamba layers. (#33660)
Signed-off-by: whx-sjtu <2952154980@qq.com>
* move checks out of `unified_kv_cache_update` custom op (#33943)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
* Update DeepGEMM version pin in Dockerfile to match #32479 (#33935)
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* Make directory exist ok for ray spinning up multiple replicas on a single instance (#33604)
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
* Perf tuning and expansion of cases covered for wvSplitKrc (#33493)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
* [Doc] Fix run_batch docs (#34056)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CI/Build] Skip GCS test (#34057)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [ROCm][Bugfix] fix act_quant_fusion module import error (#34069)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Perf] Simplify DeepseekV32 tokenizer, ensure fast detokenization used (#33855)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [ROCm] [CI] Reduce Resource of two test groups (#34059)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
* Add embedding input functionality for disabled modalities [remake] (#32493)
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Signed-off-by: Reagan Lee <reaganjlee@gmail.com>
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate (#33771)
Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
* [BugFix] Change support no act and mul for marlin (#34088)
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
* [torch.compile] Add an option to force-enable the MOE cold start optimization (#33735)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* glm 4.6 fused tuned inference config for B200 (#32958)
* Add support for ModelOpt MXFP8 dense models (#33786)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
* [Release 2.10] Update to Torch 2.10 - final release (#30525)
* [bug-fix] supported_tasks is breaking backward compatibility at init_app_state (#34027)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
* [Tiny] Rename encoder budget file to more specific name (#34103)
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
* [Frontend][last/5] Make pooling entrypoints request schema consensus. (#31127)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
* [BugFix] Fix `fastsafetensors` TP all procs using all GPUs (#34070)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
* fix(cpu): fix mla_decode compilation on x86 without AVX512 (#34052)
Signed-off-by: ihb2032 <hebome@foxmail.com>
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain>
* [Model] GLM adaptation (#34124)
* [CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr (#34107)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [ASR] Fix audio benchmark and add RTFx metric (#32300)
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
* [Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul (#33901)
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com>
* [XPU][6/N] add xpu scaled_mm kernel (#34117)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
* [MODEL] Adding Support for Qwen3.5 Models (#34110)
Signed-off-by: JJJYmmm <1650675829@qq.com>
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: wulipc <wulipc@users.noreply.github.com>
Co-authored-by: ywang96 <ywang96@users.noreply.github.com>
Co-authored-by: Isotr0py <Isotr0py@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
* [Misc] Fix up attention benchmarks (#33810)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
* [UX] Add `--language-model-only` for hybrid models (#34120)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 (#34031)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
* Add NUMA Core binding in nixl_connector for CPU xPyD (#32365)
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com>
Signed-off-by: ZhengHongming888 <hongming.zheng@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Kernel] FlashInfer: switch allreduce fusion to unified API (#33985)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
* [Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) (#34087)
Signed-off-by: Tomer Natan <tbarnatan@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* [Kernel] use flashinfer for gdn prefill (#32846)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
* [Bugfix] Avoid duplicate k-proj weight emission in helper (#34142)
Signed-off-by: Artus KG <artuskg@gmail.com>
* [Bugfix] Voxtral prompt/audio placeholder alignment (#34140)
Signed-off-by: Artus KG <artuskg@gmail.com>
* [ROCm] update triton branch to support gpt-oss models for gfx11xx devices (#34032)
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
* [torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. (#33945)
Signed-off-by: charlifu <charlifu@amd.com>
* [ModelRunner V2][BugFix] Fix `max_query_len` calculation (#34167)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [Doc] Add DCP support to attention backend doc (#33936)
* [Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available (#34153)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
* [structured output] validate unsupported json features first (#33233)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
* [LMCache] Token Base IPC API (#34175)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
* [Bugfix] Adopt `ChunkGatedDeltaRule` for Qwen3.5 (#34198)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [ROCm][Bugfix] Resolve Dynamo tracing crash from amdsmi calls in on_gfx* arch detection (#34108)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching (#34183)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [Doc] Update usage of `--limit-mm-per-prompt` (#34148)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CI/Build] Relax `test_mcp_tool_call` (#34204)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix] Fix DP Attention Padding in Dummy Run (#34187)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
* [Bugfix] Add `--trust-remote-code` to dataset bench args (#34208)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [responsesAPI] fix simpleContext streaming output_messages (#34188)
Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
* [Bugfix] Sort hf_weights_files in fastsafetensors_weights_iterator to match #33491 (#34190)
Signed-off-by: Balaxxe <136368465+jaim12005@users.noreply.github.com>
* [Frontend][CI] Consolidate instrumentator entrypoints (#34123)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
* [BugFix] Avoid prefix cache hit in the same schedule step for mamba layers (#29387)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
* [Perf] Optimize detokenizer python logic (#32975)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
* Revert #34208 (#34216)
* [Bugfix] Fix memory inconsistency in cross-process shared memory (#32022)
Signed-off-by: Zetong Li <slippersss@126.com>
* [Bugfix] Fix `--trust-remote-code` conflict (#34218)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Docs] Fix format error in KV load failure recovery doc (#34137)
Signed-off-by: Jaebok Lee <jaebok9541@naver.com>
* [Bugfix] Fix FI kernel`chunk_gated_delta_rule` output shape for Qwen3.5 (#34219)
Signed-off-by: Roger Wang <hey@rogerw.io>
* Add flagos in MiniCPM-o (#34126)
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Signed-off-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
Co-authored-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
* [Misc] allow specify is_mm_prefix_lm in hf_config (#34215)
* Stop testing for slow tokenizers as they will not exist soon (#34235)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input (#34220)
Signed-off-by: KrxGu <krishom70@gmail.com>
* Bump `mamba-ssm` version in CI for Transformers v5 compatibility (#34233)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* add --insecure arg to the vllm bench to skip TLS (#34026)
Signed-off-by: Fan Yang <yan9fan@meta.com>
Co-authored-by: Fan Yang <yan9fan@meta.com>
* Support benchmarking of Geospatial models (#33922)
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
* [ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations (#29008)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [compile] Enable AOT compile with 2.10 in trunk. (#34155)
Signed-off-by: Zhengxu Chen <zhxchen17@meta.com>
* [Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention (#33680)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* [Core][BugFix] Fix PP KV cache sharding memory validation (#33698)
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
* [BUGFIX] Fix accuracy bugs in Qwen3-Next MTP (#34077)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
* [Docs] Speed up build environment set-up (#34240)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Model Runner V2] Use pinned memory for write_contents (#34222)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
* Minor cleanup for Voxtral (#34247)
Signed-off-by: Andy Lo <andy@mistral.ai>
* [UX nit] Fix non-default api_server_count message (#34152)
Signed-off-by: mgoin <mgoin64@gmail.com>
* [Misc] Introduce ec_both role EC (encoder cache) connector (#34182)
Signed-off-by: Qi Wang <qiwa@nvidia.com>
* Convert online APIs to use Renderer (#34084)
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
* [Bugfix] Fix weights offloading for sleep mode (#32947)
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
* [Benchmarks] Fix attention benchmark smoke test (#34269)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
* [Bugfix] Fix mamba cache dtype for Qwen3.5 (#34200)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [SM100] Resubmit FMHA FP8 prefill for MLA (#31195)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
* [Feature] Warn about unrecognized environment variables (#33581)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
* [Perf] Move eplb rebalance algo to async thread (#30888)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
* [BugFix] Fix async EPLB hang with DeepEP LL all2all backend (#32860)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
* [Misc][Spec Decode] support different load config for draft model (#34022)
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
* [torch.compile] Disable recursive pre_grad_passes (#34092)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* [Misc] Add pre-commit hook to catch boolean ops in with-statements (#34271)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* [CI] Add pip caching to cleanup_pr_body workflow (#32979)
Signed-off-by: 7. Sun <jhao.sun@gmail.com>
* [MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner (#32344)
Signed-off-by: Bill Nell <bnell@redhat.com>
* [ROCm][CI] Fix test_sequence_parallel.py location in AMD CI pipeline (#34280)
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
* [Misc] Add run one batch script that supports profiling (#32968)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
* [Bugfix] Fix Worker.load_model context-manager composition for sleep mode (#34021)
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com>
* [Redo] Add `--trust-remote-code` to dataset bench args (#34251)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [torch.compile] Stop doing unnecessary FakeTensorProp in PiecewiseCompileInterpreter (#34093)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* [WideEP] Fix nvfp4 DeepEP High Throughput All2All backend (#33738)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Misc] Clean up validation logic in input processor (#34144)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast (#33884)
Signed-off-by: Kebe <mail@kebe7jun.com>
* [Kernel] Apply 256bit LDG/STG To Activation Kernels (#33022)
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* [XPU][7/N] enable xpu fp8 moe (#34202)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
* [Plugin] Simplify IO Processor Plugin interface (#34236)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 (#34149)
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
* Threshold fix wvSplitk for occasional CI fails (#34013)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
* [ModelBash][DSR1 NVFp4] Removed Bf16 Bias Cast (#34298)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
* [Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides (#34279)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* [Bugfix] Fix weight naming in Qwen3.5 (#34313)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [CPU] Enable FP16 (Half dtype) support for s390x (#34116)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
* [model] support FunASR model (#33247)
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
* [XPU][9/N] clean up existing ipex code/doc (#34111)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
* [Chore] Move `BaseRenderer` to `base.py` (#34308)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [torch.compile] Enable AR+rms fusion by default available for `-O2` (#34299)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
* [Misc] Bump `fastsafetensors` version for latest fixes (#34273)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [Doc] Update Marlin support matrix for Turing (#34319)
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com>
* [Frontend] Exploit tokenizers "new stream" in FastIncrementalDetokenizer (#34217)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* Patch protobuf for CVE-2026-0994 (#34253)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
* [Docs] Reduce time spent generating API docs (#34255)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Bugfix][CPU] Fix llama4 inference on CPU (#34321)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
* Make Qwen3VL compatible with Transformers v5 (#34262)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
* Make JAIS compatible with Transformers v5 (#34264)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE (#33715)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
* Responses harmony system message structured (#34268)
Signed-off-by: Adam Binford <adamq43@gmail.com>
* Reapply [Attention][FA3] Update FA3 to include new swizzle optimization (#34043)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
* Don't try and run GLM-ASR with remote code (#34352)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache (#33948)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
* [ROCm] [aiter] Split KV cache update for AiterFlashAttention (#33681)
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
* [Docs] Fix typo ("defult") and double spacing (#34348)
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… (#32458)
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
* [Model Runner V2] Init cuda graph pool when necessary (#33217)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
* [Multimodal] Expose `mm_processor_kwargs` for `DummyInputsBuilder` (#34330)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Bugfix] fix default is_neox_style is True for deepseek (#34353)
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
* [Bugfix] Enable attn quantization of Llama-4 by correctly permuting scales for rope (int8, fp8) (#34243)
Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
* [ROCm] [CI] fix test_unrecognized_env (#34350)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
* [GPT-OSS] Remove unnecessary contiguous (#34337)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
* Add cartridge (prefix) benchmark configs to CI workflows
The prefix_latency and prefix_throughput configs existed but weren't
being run by any workflow. Each benchmark workflow now runs both the
base and cartridge configs using the shared server support.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* update flashinfer
* update wheel
* update cuda and flashinfer
* downgrade
* update tests
* experimental: implement pipelining
* add pipeline test
* configure PR to actually run
* bugfix
* loosen TPOT threshold for catridge latency
* improve pipelining
* simplify pipelining impl
---------
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: rinbaro <ilgomishra@gmail.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Andrew Xia <axia@fb.com>
Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: mariohong <mariohong128@gmail.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: rabi <ramishra@redhat.com>
Signed-off-by: limingliang <limingliang@stepfun.com>
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: sihao.li <sihao.li@intel.com>
Signed-off-by: Chengcheng Pei <chengchengpei@outlook.com>
Signed-off-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Signed-off-by: Gassan <gassan.salama@arm.com>
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: kurt <kurt@thinkingmachines.ai>
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch>
Signed-off-by: caitianchi <caitianchi@modelbest.cn>
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
Signed-off-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Ikenna <ikennachifo@gmail.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Signed-off-by: Richard Zou <zou3519@gmail.com>
Signed-off-by: Akintunde Oladipo <akintunde.oladipo@servicenow.com>
Signed-off-by: TundeAtSN <akintunde.oladipo@servicenow.com>
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Signed-off-by: Reagan Lee <reaganjlee@gmail.com>
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Signed-off-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Signed-off-by: ihb2032 <hebome@foxmail.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com>
Signed-off-by: JJJYmmm <1650675829@qq.com>
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com>
Signed-off-by: ZhengHongming888 <hongming.zheng@intel.com>
Signed-off-by: Tomer Natan <tbarnatan@nvidia.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Artus KG <artuskg@gmail.com>
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Balaxxe <136368465+jaim12005@users.noreply.github.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Zetong Li <slippersss@126.com>
Signed-off-by: Jaebok Lee <jaebok9541@naver.com>
Signed-off-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
Signed-off-by: KrxGu <krishom70@gmail.com>
Signed-off-by: Fan Yang <yan9fan@meta.com>
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Zhengxu Chen <zhxchen17@meta.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Andy Lo <andy@mistral.ai>
Signed-off-by: Qi Wang <qiwa@nvidia.com>
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
Signed-off-by: 7. Sun <jhao.sun@gmail.com>
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com>
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com>
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Signed-off-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Co-authored-by: Andreas Karatzas <akaratza@amd.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: rinbaro <ilgomishra@gmail.com>
Co-authored-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Andrew Xia <axia@meta.com>
Co-authored-by: Andrew Xia <axia@fb.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Mario Hong <86880754+mariohong128@users.noreply.github.com>
Co-authored-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: danisereb <daserebrenik@nvidia.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
Co-authored-by: zackyoray <yorayz@nvidia.com>
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Tsukasa OI <floss_llm@irq.a4lg.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Lumosis <30372757+Lumosis@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Hashem Hashemi <159079214+amd-hhashemi@users.noreply.github.com>
Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Co-authored-by: emricksini-h <emrick.birivoutin@hcompany.ai>
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Rabi Mishra <ramishra@redhat.com>
Co-authored-by: Mingliang Li <limingliang0527@gmail.com>
Co-authored-by: limingliang <limingliang@stepfun.com>
Co-authored-by: R3hankhan <Rehan.Khan7@ibm.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: sihao_li <165983188+1643661061leo@users.noreply.github.com>
Co-authored-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Gassan Salama <gassan.salama@arm.com>
Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com>
Co-authored-by: zhang-prog <69562787+zhang-prog@users.noreply.github.com>
Co-authored-by: Kurt Shuster <shuster.kurt@gmail.com>
Co-authored-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: FredericOdermatt <50372080+FredericOdermatt@users.noreply.github.com>
Co-authored-by: tc-mb <157115220+tc-mb@users.noreply.github.com>
Co-authored-by: mslv <mslv@baai.ac.cn>
Co-authored-by: zofia <110436990+zufangzhu@users.noreply.github.com>
Co-authored-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>
Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com>
Co-authored-by: Charlie Fu <charlifu@amd.com>
Co-authored-by: xuebwang-amd <xuebwang@amd.com>
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <dbari@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Ikenna <ikennachifo@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: 果冻虾仁 <guodong@apache.org>
Co-authored-by: Vel <110626982+Code4me2@users.noreply.github.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com>
Co-authored-by: TundeAtSN <akintunde.oladipo@servicenow.com>
Co-authored-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: zifeitong <zifeitong@gmail.com>
Co-authored-by: Jiang Wu <jwu@cclgroup.com>
Co-authored-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: aabbccddwasd <140953076+aabbccddwasd@users.noreply.github.com>
Co-authored-by: TomerBN-Nvidia <tbarnatan@nvidia.com>
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Co-authored-by: navmarri14 <nmarri@roblox.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
Co-authored-by: ihb2032 <40718643+ihb2032@users.noreply.github.com>
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain>
Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com>
Co-authored-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com>
Co-authored-by: wulipc <wulipc@users.noreply.github.com>
Co-authored-by: ywang96 <ywang96@users.noreply.github.com>
Co-authored-by: Isotr0py <Isotr0py@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: ZhengHongming888 <hongming.zheng@intel.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: Artus Krohn-Grimberghe <artuskg@users.noreply.github.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Ning Xie <andy.xning@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Yuwei An <ayw.sirius19@gmail.com>
Co-authored-by: Balaxxe <136368465+jaim12005@users.noreply.github.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Zetong Li <48438720+slippersss@users.noreply.github.com>
Co-authored-by: zzaebok <44357534+zzaebok@users.noreply.github.com>
Co-authored-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
Co-authored-by: Phúc H. Lê Khắc <lkhphuc@pm.me>
Co-authored-by: Krish Gupta <krishom70@gmail.com>
Co-authored-by: Fan Yang <fanyang.real@gmail.com>
Co-authored-by: Fan Yang <yan9fan@meta.com>
Co-authored-by: mgazz <michele.gazzetti1@ibm.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Zhengxu Chen <zhxchen17@meta.com>
Co-authored-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Andy Lo <andy@mistral.ai>
Co-authored-by: Qi Wang <wqstu1@gmail.com>
Co-authored-by: J Seppänen <83203+jseppanen@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Ilya Markov <markovilya197@gmail.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Zhengkai Zhang <33679250+ZhengkaiZ@users.noreply.github.com>
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
Co-authored-by: 7. Sun <jhao.sun@gmail.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: tianshu-Michael-yu <101950379+tianshu-Michael-yu@users.noreply.github.com>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Co-authored-by: Matthias Gehre <matthias.gehre@amd.com>
Co-authored-by: AllenDou <allen.dou@hotmail.com>
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Tianqi Ren <tianqi.r@outlook.com>
Co-authored-by: Linda <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: Adam Binford <adamq43@gmail.com>
Co-authored-by: kliuae <17350011+kliuae@users.noreply.github.com>
Co-authored-by: Xinyu Dong <dongxinyu03@baidu.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Purpose
Add support for ModelOpt MXFP8 dense models.
No support for MoE yet.
Related PRs
NVIDIA/Model-Optimizer#736
Test Plan
Use this LLM model (BF16):
https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B
Convert the model to MXFP8 using ModelOpt:
The command above will generate a checkpoint
nvidia/OpenMath2-Llama3.1-8B-MXFP8.Compare performance (tokens/sec) and accuracy (gsm8k) of the BF16 and MXFP8 models.
Test Result
Performance (tokens/sec):
Measured on B200:
vllm bench throughput --model $MODEL_PATH \ --tensor-parallel-size 1 \ --trust-remote-code \ --async-scheduling \ --backend vllm \ --dataset-name random \ --random-prefix-len 0 \ --random-input-len 1024 \ --random-output-len 1024 \ --max-num-seqs 128 \ --num-prompts 512BF16
MXFP8
Accuracy (GSM8K):
lm_eval \ --model vllm \ --model_args pretrained=$MODEL_PATH,max_model_len=4096,enforce_eager=True,attention_backend=TRITON_ATTN \ --tasks gsm8k \ --batch_size auto --limit 300BF16
MXFP8
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.