Releases: NVIDIA/TensorRT-LLM
v1.2.0rc8
Highlights
-
Model Support
- Add export patch for GraniteMoe MoE models to enable torch.export compatibility (#10169)
- Eagle: qwen2 capture hidden states (#10091)
- Add pp support for DeepSeek-v3.2 (#10449)
- Pass lora_params through Qwen2/3 model forward (#10174)
- Fix export for microsoft/Phi-3-medium-128k-instruct (#10455)
- Mistral large 3 few code refine (#10405)
- EPD for Qwen3 VL (#10470)
- Remove some model support; add device constraint (#10563)
- Enable AttentionDP on Qwen3-VL and fix test (#10435)
-
API
- Add stability tags for serve subcommand (#10012)
-
Feature
- Better align MLA chunking with indexer chunking when chunked prefill enabled for DSV32 (#10552)
- Sm100 weight-only kernel (#10190)
- AutoTuner Cache: Support cache file lock and merge all ranks into one (#10336)
- Apply AutoTuner to AllReduce Op for strategy tuning (#8531)
- Add transferAgent binding (step 1) (#10113)
- Add the eos tokens in generation config to stop words in the sampler (#10389)
- Apply fusion for W4AFP8_AWQ MoE (#9838)
- Further reduce tuning time for cuteDSL nvFP4 dense gemm (#10339)
- Run sample_async on extra stream (#10215)
- Optimize qk rope/nope concat for DSA (#10571)
-
Fix
- Fix bug of Mistral-Small-3.1-24B-Instruct-2503 (#10394)
- Use 0 port as arbitrary port when disagg service discovery is enabled (#10383)
- Fix buffer reuse for CUDA graph attention metadata (#10393)
- Force release torch memory when LLM is destroyed (#10314)
- Swap TP-CP grouping order (#10350)
- TRTLLM MoE maps to lower tuning buckets when ep>1 (#9998)
- Fix draft token tree chain crash and depth=1 corner case (#10386, #10385)
- Fixed recursive node traversals (#10379)
- Fix undefined tokens_per_block (#10438)
- Skip spec dec for non-last rank (#10445)
- Setup dist before using autotuner (#10491)
- Fix broken cast (#9975)
- Fix sm120 speculation (#10049)
- Fix mamba_cache_manager when enabling cuda_graph_padding and let test cover this case (#9873)
- Choose register model config over root config for VLM (#10553)
-
Documentation
- Update SWA + spec dec support matrix (#10421)
- Add --config preference over --extra_llm_api_options in CODING_GUIDELINES.md (#10426)
- Adding parallelism types in feature combination matrix (#9849)
- Update GPTOSS Doc (#10536)
- Blog: Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs (#10565)
- Update Qwen3-Next doc by adding known issues section (#10582)
-
Test & Infra
- Add tests for DeepSeek v3.2 (#10561)
- Add accuracy tests for super-v3 with multiple-gpus (#10234)
- Layer-wise benchmarks: support TEP balance, polish slurm scripts (#10237)
- Add disag-serving kimi k2 thinking tests (#10357)
- Partition test_llm_pytorch.py for parallel execution (#10400)
- Only Use Throughput Metrics to Check Regression (#10404)
- Add vswa test cases coverage (#10146)
- Use random port in container port section (#10432)
- Remove redundant retries while binding to arbitrary port (#10452)
- Add qwen3-4b accuracy test case (#10382)
- Update kimi-k2-1k1k dataset (#10473)
- Fix concurrency list in Wide-EP perf tests (#10529)
- Restrict max_num_tokens in disagg mtp config (#10442)
- Add kimi_k2 single node perf test (#10436)
- Add MMMU test for mistral small (#10530)
- Workaround OCI-NRT slowdown issue (#10587)
What's Changed
- [#8391][chore] added deepseek_r1_distill_qwen_32b AutoDeploy perf test to L0 by @MrGeva in #10377
- [https://nvbugs/5670469][fix] Filter 0s and choose min of kv_head for Nemotron model by @farazkh80 in #10206
- [https://nvbugs/5772363][fix] fix bug of Mistral-Small-3.1-24B-Instruct-2503 by @byshiue in #10394
- [https://nvbugs/5649010][fix] use 0 port as arbitrary port when disagg service discovery is enabled by @reasonsolo in #10383
- [TRTLLM-10065][feat] Add accuracy tests for super-v3 with multiple-gpus by @Wanli-Jiang in #10234
- [https://nvbugs/5779534][fix] fix buffer reuse for CUDA graph attention metadata by @lfr-0531 in #10393
- [None][feat] sm100 weight-only kernel by @Njuapp in #10190
- [https://nvbugs/5701425][chore] Unwaive tests. by @yuxianq in #10269
- [None][feat] Layer-wise benchmarks: support TEP balance, polish slurm scripts by @yuantailing in #10237
- [None][infra] Waive failed cases in post-merge on 1/5 by @EmmaQiaoCh in #10399
- [TRTLLM-10185][feat] AutoTuner Cache: Support cache file lock and merge all ranks into one by @hyukn in #10336
- [TRTLLM-8242][feat] Add stability tags for serve subcommand by @LinPoly in #10012
- [https://nvbugs/5752521][fix] Unwaive test_trtllm_flashinfer_symbol_collision.py by @yihwang-nv in #10227
- [None][infra] Waive failed cases again on 1/5 by @EmmaQiaoCh in #10403
- [https://nvbugs/5715568][fix] Force to release torch memory when LLM is destroyed by @HuiGao-NV in #10314
- [TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. by @hyukn in #8531
- [None][feat] update deepgemm to the DeepGEMM/nv_dev branch by @lfr-0531 in #9898
- [TRTLLM-9381][test] add disag-serving kimi k2 thinking tests by @xinhe-nv in #10357
- [#10374][fix] fixed race condition in AutoDeploy's mp tests port acquisition by @MrGeva in #10366
- [TRTLLM-9465][fix] Swap TP-CP grouping order by @brb-nv in #10350
- [None][perf] TRTLLM MoE maps to lower tuning buckets when ep>1 by @rosenrodt in #9998
- [TRTLLM-10053][feat] AutoDeploy: Add Super v3 config file, improve test runtime by @galagam in #10397
- [https://nvbugs/5772521][fix] Fix draft token tree chain crash by @mikeiovine in #10386
- [https://nvbugs/5772414][fix] Fix draft token tree depth=1 corner case by @mikeiovine in #10385
- [TRTLLM-9767][feat] Fixed recursive node traversals by @greg-kwasniewski1 in #10379
- [TRTLLM-9551][infra] Partition test_llm_pytorch.py for parallel execution by @Superjomn in #10400
- [https://nvbugs/5695984][fix] Unwaive llama3 eagle test by @mikeiovine in #10092
- [https://nvbugs/5745152][fix] Unwaive gpt oss spec decode test by @mikeiovine in #10370
- [#10170][fix] Add export patch for GraniteMoe MoE models to enable torch.export compatibility by @karthikvetrivel in #10169
- [https://nvbugs/5777044][chore] Remove solved bugs from waives.txt by @SimengLiu-nv in #10422
- [None][feat] precompiled installation from local src dir by @lucaslie in #10419
- [TRTLLM-9527][feat] Add transferAgent binding (step 1) by @chuangz0 in #10113
- [None][fix] Only Use Throughput Metrics to Check Regression by @chenfeiz0326 in #10404
- [None][feat] add the eos tokens in generation config to stop words in the sampler by @JadoTu in #10389
- [None][chore] Update SWA + spec dec support matrix by @mikeiovine in #10421
- [None][feat] CuteDSL MOE FC1 Enhancement by @liyuhannnnn in #10088
- [https://nvbugs/5726962][feat] Apply fusion for W4AFP8_AWQ MoE by @yumin066 in #9838
- [#2511][fix] eagle: qwen2 capture hidden states by @XiaoXuan42 in #10091
- [None][docs] Add
--configpreference over--extra_llm_api_optionsin CODING_GUIDELINES.md by @venkywonka in #10426 - [#8460][feat] Revive and simplify Model Explorer visualization integration by @karthikvetrivel in #10150
- [None][chore] unwaive qwen3 30b test by @kris1025 in #10115
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10384
- [None][test] update test case constraint by @crazydemo in #10381
- [https://nvbugs/5769926] [fix] Add no container mount home WAR by @kaiyux in #10431
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10367
- [TRTLLM-9622][infra] Enable DGX_B300 multi-gpu testing in pre-merge pipeline by @yiqingy0 in #9699
- [TRTLLM-9896][test] add vswa test cases coverage by @crazydemo in #10146
- [None] [fix] Fix undefined tokens_per_block by @kaiyux in #10438
- [https://nvbugs/5772361][ci] Unwaive tests that have been fixed by @2ez4bz...
v1.2.0rc6.post1
Security Vulnerabilities
GnuPG Vulnerability
A security vulnerability has been identified in GnuPG versions prior to 2.4.9, which is present in the Ubuntu 24.04 LTS utilized by the TensorRT LLM base image. For details regarding this vulnerability, please refer to the official Ubuntu advisory: CVE-2025-68973. An official patched package for the Ubuntu system is currently pending. The fix will be included in the next release once the updated package is published and incorporated. To mitigate potential risks immediately, users are advised to manually upgrade GnuPG to version 2.4.9 or later.
Hugging Face Transformers Vulnerabilities
Several security vulnerabilities have been disclosed regarding the Hugging Face Transformers library used in TensorRT LLM. As these issues originate from an upstream dependency, remediation is dependent on the release of a patch by the Hugging Face team. We are actively monitoring the situation and will update TensorRT LLM to include the necessary fixes once a stable release of the Transformers library addressing these vulnerabilities becomes available. Affected CVEs: CVE-2025-14920, CVE-2025-14921, CVE-2025-14924, CVE-2025-14927, CVE-2025-14928, CVE-2025-14929, CVE-2025-14930
What's Changed
- [https://nvbugs/5708810][fix] Fix TRTLLMSampler by @moraxu in #9710
- [TRTLLM-9641][infra] Use public triton 3.5.0 in SBSA by @ZhanruiSunCh in #9652
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #9979
- [TRTLLM-9794][ci] move more test cases to gb200 by @QiJune in #9994
- [None][feat] Add routing support for the new model for both cutlass and trtllm moe backend by @ChristinaZ in #9792
- [TRTLLM-8310][feat] Add Qwen3-VL-MoE by @yechank-nvidia in #9689
- [https://nvbugs/5731717][fix] fixed flashinfer build race condition during test by @MrGeva in #9983
- [FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 MoE cutlass by @Wanli-Jiang in #9358
- [None][chore] Update internal_cutlass_kernels artifacts by @yihwang-nv in #9992
- [None][docs] Add README for Nemotron Nano v3 by @2ez4bz in #10017
- [None][infra] Fixing credential loading in lockfile generation pipeline by @yuanjingx87 in #10020
- [https://nvbugs/5727952][fix] a pdl bug in trtllm-gen fmha kernels by @PerkzZheng in #9913
- [None][infra] Waive failed test for main branch on 12/16 by @EmmaQiaoCh in #10029
- [None][doc] Update CONTRIBUTING.md by @syuoni in #10023
- [None][fix] Fix Illegal Memory Access for CuteDSL Grouped GEMM by @syuoni in #10008
- [TRTLLM-9181][feat] improve disagg-server prometheus metrics; synchronize workers' clocks when workers are dynamic by @reasonsolo in #9726
- [None][chore] Final mass integration of release/1.1 by @mikeiovine in #9960
- [None][fix] Fix iteration stats for spec-dec by @achartier in #9855
- [https://nvbugs/5741060][fix] Fix pg op test by @shuyixiong in #9989
- [https://nvbugs/5635153][chore] Remove responses tests from waive list by @JunyiXu-nv in #10026
- [None] [feat] Enhancements to slurm scripts by @kaiyux in #10031
- [None][infra] Waive failed tests due to llm model files by @EmmaQiaoCh in #10068
- [None][fix] Enabled simultaneous support for low-precision combine and MTP. by @yilin-void in #9091
- [https://nvbugs/5698434][test] Add Qwen3-4B-Eagle3 One-model perf test by @yufeiwu-nv in #10041
- [TRTLLM-9998][fix] Change trtllm-gen MoE distributed tuning strategy back to INDEPENDENT by @hyukn in #10036
- [TRTLLM-9989][fix] Disable tvm_ffi for CuteDSL nvFP4 dense GEMM. by @hyukn in #10040
- [None][chore] Remove unnecessary warning log for tuning. by @hyukn in #10077
- [TRTLLM-9680][perf] Optimize TRTLLMSampler log_probs performance (Core fix has been merged via #9353) by @tongyuantongyu in #9655
- [None][chore] Bump version to 1.2.0rc6.post1 by @yiqingy0 in #10484
Full Changelog: v1.2.0rc6...v1.2.0rc6.post1
v1.2.0rc2.post1
Security Vulnerabilities
GnuPG Vulnerability
A security vulnerability has been identified in GnuPG versions prior to 2.4.9, which is present in the Ubuntu 24.04 LTS utilized by the TensorRT LLM base image. For details regarding this vulnerability, please refer to the official Ubuntu advisory: CVE-2025-68973. An official patched package for the Ubuntu system is currently pending. The fix will be included in the next release once the updated package is published and incorporated. To mitigate potential risks immediately, users are advised to manually upgrade GnuPG to version 2.4.9 or later.
Hugging Face Transformers Vulnerabilities
Several security vulnerabilities have been disclosed regarding the Hugging Face Transformers library used in TensorRT LLM. As these issues originate from an upstream dependency, remediation is dependent on the release of a patch by the Hugging Face team. We are actively monitoring the situation and will update TensorRT LLM to include the necessary fixes once a stable release of the Transformers library addressing these vulnerabilities becomes available. Affected CVEs: CVE-2025-14920, CVE-2025-14921, CVE-2025-14924, CVE-2025-14927, CVE-2025-14928, CVE-2025-14929, CVE-2025-14930
What's Changed
- [None][chore] Bump version to 1.2.0rc2.post1 by @yiqingy0 in #10286
- [TRTLLM-9752][fix] disable PDL for quant kernels by @bo-nv in #10288
Full Changelog: v1.2.0rc2...v1.2.0rc2.post1
v1.2.0rc7
Security Vulnerabilities
GnuPG Vulnerability
A security vulnerability has been identified in GnuPG versions prior to 2.4.9, which is present in the Ubuntu 24.04 LTS utilized by the TensorRT LLM base image. For details regarding this vulnerability, please refer to the official Ubuntu advisory: CVE-2025-68973. An official patched package for the Ubuntu system is currently pending. The fix will be included in the next release once the updated package is published and incorporated. To mitigate potential risks immediately, users are advised to manually upgrade GnuPG to version 2.4.9 or later.
Hugging Face Transformers Vulnerabilities
Several security vulnerabilities have been disclosed regarding the Hugging Face Transformers library used in TensorRT LLM. As these issues originate from an upstream dependency, remediation is dependent on the release of a patch by the Hugging Face team. We are actively monitoring the situation and will update TensorRT LLM to include the necessary fixes once a stable release of the Transformers library addressing these vulnerabilities becomes available. Affected CVEs: CVE-2025-14920, CVE-2025-14921, CVE-2025-14924, CVE-2025-14927, CVE-2025-14928, CVE-2025-14929, CVE-2025-14930
Highlights
-
Model Support
- Add Qwen3-VL-MoE (#9689)
- Support DeepSeek-V32 chat template (#9814)
- Support DeepSeek-V3.2, R1 and V3.1 tool parser (#10126, #10010)
- Support Eagle3 on Mistral Large3 (#9971)
- Support VLM part for Mistral Large 3 (#10188)
- Support multi-gpu running for nemotron-v3-nano and super (#10118)
- Support Qwen3-VL dense model in pytorch backend (#9060)
- Support NVFP4 for gptoss (#8956)
- Add MLA Based Eagle (#9677)
-
API
-
Feature
- Support NVFP4 weight and weight_scale padding for MoE cutlass (#9358)
- Add routing support for the new model for cutlass and TRTLLM MoE backend (#9792)
- Improve disagg-server prometheus metrics and synchronize dynamic workers’ clocks (#9726)
- Update TRT-LLM Gen MoE for NvFp4 + bias with tileN=256 (#9734)
- Add optimization options for MOE CuteDSL finalized kernel (#10042)
- Add fp8 bmm on sm120 (#9687)
- Reuse alltoall workspace for CuteDSL MoE output (#9840)
- Support Mooncake transfer engine as cache transceiver backend (#8309)
- Enable KV cache reuse for config database (#10094)
- Enable PDL for CuteDSL kernels and overlap MoeOutputMemset (#10043)
- Cudagraph updates for helix parallelism (#10141)
- Custom AllToAll for helix parallelism (#9986)
- Pass MRoPE tensors for EPD disagg (#9758)
- Reuse previous draft requests if possible (#10263)
- Make PDL enabled by default (#9695)
- Enable 2CTA with autotune for CuteDSL MoE and Grouped GEMM optimizations (#10201)
- Provide attention NVFP4 out support for torch compile (#9740)
- Increase topk upper limit to 22 for NVLinkOneSided AlltoAll (#10229)
- Deliver optimizations for two-model speculative decoding (#10208)
-
Fix
- Fix PDL bug in trtllm-gen FMHA kernels (#9913)
- Fix Illegal Memory Access for CuteDSL Grouped GEMM (#10008)
- Disable tvm_ffi for CuteDSL nvFP4 dense GEMM (#10040)
- Fix ready signal in NIXL backend (#10000)
- Fix top_k=10 in NVLinkOneSided AlltoAll (#10197)
- Fix race conditions in KV cache communication during unexpected termination (#10076)
- Fix deepseek sharding (#9984)
- Fix contiguous view usage in load_expert weights (#10136)
- Fix detokenizer issue for DeepSeek-v3.2 (#10106)
- Fix indice offset overflow in custom Top-K kernel and UT (#10027)
- Fix draft_lengths for CUDA graph capture (#10004)
- Fix port conflict handling for CI (#10392, #10175, #10035)
- Fix NVFP4 linear method weight and weight_scale padding (#10148)
- Fix VSWA block store/load scheme in KV cache manager (#10183)
- Fix ready signal and execution_stream synchronization across components (#10060)
- Fix PP+CP combination with helix parallelism (#10312)
- Fix Gemma3 RoPE for local attention (#9961)
- Make NCCL resource manager destructor exception-safe (#10166)
- Fix detokenizer / tokenizer issues (use local tokenizer, cache vocab) (#10230, #10219)
- Disable PDL for quant kernels to address accuracy (#10285)
- Fix hilo: Avoid property with setter in nn modules (#10212)
-
Documentation
- Add README for Nemotron Nano v3 (#10017)
- Update CONTRIBUTING.md (#10023)
- Update online benchmarking docs (#9611)
- Update Dynamo Example document (#9619, #10368)
- Update Perf_Overview.md with benchmarking results (#9723)
- Add NIXL-Libfabric usage documentation (#10205)
- Add Sparse Attention feature doc (#9648)
- Update IFB performance guide & GPTOSS deployment guide (#10283)
- Promote perfect MoE router feature documentation (#10303)
-
Test & Infra
- Fix credential loading in lockfile generation pipeline (#10020)
- Add Qwen3-4B-Eagle3 one-model perf test (#10041)
- Add regression testing for config database (#9832)
- Update tests for nemotron_h (#9993)
- Use ucx as default backend (#10101)
- Fix OpenSearch URL in slurm_launch.sh for multinode perf sanity (#9990)
- Remove helix test from RTX test list (#10224)
- Add ray test robustness and RL perf reproduce script (#9939)
- Support multi-node disagg perf test in CI (#9138)
- Enable single-gpu CI on spark (#9304)
- Add disaggregated stress test (#9354)
- Include LongBenchV1 in trtllm-eval (eval infra aspect) (#10265)
- Fix port conflict avoidance in CI via get_free_port_in_ci (#10392)
Full Changelog: v1.2.0rc7...v1.2.0rc7
What's Changed
- [https://nvbugs/5708810][fix] Fix TRTLLMSampler by @moraxu in #9710
- [TRTLLM-9641][infra] Use public triton 3.5.0 in SBSA by @ZhanruiSunCh in #9652
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #9979
- [TRTLLM-9794][ci] move more test cases to gb200 by @QiJune in #9994
- [None][feat] Add routing support for the new model for both cutlass and trtllm moe backend by @ChristinaZ in #9792
- [TRTLLM-8310][feat] Add Qwen3-VL-MoE by @yechank-nvidia in #9689
- [https://nvbugs/5731717][fix] fixed flashinfer build race condition during test by @MrGeva in #9983
- [FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 MoE cutlass by @Wanli-Jiang in #9358
- [None][chore] Update internal_cutlass_kernels artifacts by @yihwang-nv in #9992
- [None][docs] Add README for Nemotron Nano v3 by @2ez4bz in #10017
- [None][infra] Fixing credential loading in lockfile generation pipeline by @yuanjingx87 in #10020
- [https://nvbugs/5727952][fix] a pdl bug in trtllm-gen fmha kernels by @PerkzZheng in #9913
- [None][infra] Waive failed test for main branch on 12/16 by @EmmaQiaoCh in #10029
- [None][doc] Update CONTRIBUTING.md by @syuoni in #10023
- [None][fix] Fix Illegal Memory Access for CuteDSL Grouped GEMM by @syuoni in #10008
- [TRTLLM-9181][feat] improve disagg-server prometheus metrics; synchronize workers' clocks when workers are dynamic by @reasonsolo in #9726
- [None][chore] Final mass integration of release/1.1 by @mikeiovine in #9960
- [None][fix] Fix iteration stats for spec-dec by @achartier in #9855
- [https://nvbugs/5741060][fix] Fix pg op test by @shuyixiong in #9989
- [https://nvbugs/5635153][chore] Remove responses tests from waive list by @JunyiXu-nv in #10026
- [None] [feat] Enhancements to slurm scripts by @kaiyux in #10031
- [None][infra] Waive failed tests due to llm model files by @EmmaQiaoCh in #10068
- [None][fix] Enabled simultaneous support for low-precision combine and MTP. by @yilin-void in #9091
- [https://nvbugs/5698434][test] Add Qwen3-4B-Eagle3 One-model perf test by @yufeiwu-nv in #10041
- [TRTLLM-9998][fix] Change trtllm-gen MoE distributed tuning strategy back to INDEPENDENT by @hyukn in #10036
- [TRTLLM-9989][fix] Disable tvm_ffi for CuteDSL nvFP4 dense GEMM. by @hyukn in #10040
- [None][chore] Remove unnecessary warning log for tuning. by @hyukn in #10077
- [TRTLLM-9680][perf] Optimize TRTLLMSampler log_probs performa...
v1.2.0rc6
Highlights
-
Model Support
-
API
-
Feature
- 2D parallel EP TP support (#9459)
- Fused kernels (qknormrope + moe routing) and two-model MTP support for glm4moe (#9852)
- Add gather fc1 kernel by cuteDSL (#9618)
- Add GB300 support since it does not support segment (#9731)
- Add helixPostProcessNative kernel for cp_dim=2 (#9924)
- Added symetric memory AllReduce strategy (#8919)
- ConfigurableMoE support (#9772, #9858)
- Enable multistream for Linear Attention in Qwen3 (#9696)
- Enable PDL for indexer topK (#9843)
- Implement distributed tuning system (#9621)
- Implement sampling on 1-model EAGLE3 (#9885)
- Move D->H copies to a worker thread (#8463)
- Optimize the host overhead of _sample_async (#9935)
- Port fp4 quantization kernel optimization from FlashInfer (#9854)
- Support larger topK for NVLinkOneSided AlltoAll. (#9816)
-
Fix
- Fix CUDA stream sync issue in ModelRunnerCPP (#6426)
- Fix accuracy issue in TRTLLM MoE (#9999)
- Fix PDL in TRTLLM MOE for dsv3 (#9799)
- Fix unterminated process issue for RemoteOpenAIServer (#9490)
- Fix PDL bugs with trtllm-gen fmha kernels (#9863)
- Use first PP rank's schedule result in other PP ranks to fix PP hang (#9659)
-
Documentation
-
Test & Infra
What's Changed
- [https://nvbugs/5703953][fix] Preserving ip:port for trtllm-serve before initializing llm by @JunyiXu-nv in #9646
- [None][infra] Waive failed cases for main branch on 12/07 by @EmmaQiaoCh in #9769
- [None][fix] Several minor fixes to CI setting by @chzblych in #9765
- [OMNIML-3036][doc] Re-branding TensorRT-Model-Optimizer as Nvidia Model-Optimizer by @cjluo-nv in #9679
- [None][feat] Enable NCCL_SYMMETRIC as default fallback for AllReduce by @nv-lschneider in #9314
- [TRTLLM-9000][feat] Add multi-node Perf Tests into CI by @chenfeiz0326 in #8800
- [None][test] add ntp tolerance in time metrics verification by @zhengd-nv in #9741
- [TRTLLM-9603][feat] Enable ConfigurableMoE test in the CI by @xxi-nv in #9645
- [https://nvbugs/5422621][test] Add GB 200 WIDEEP test case for RCCA 5422621 by @fredricz-20070104 in #9506
- [None][fix] Fix two tuning cache miss issues. by @hyukn in #9743
- [TRTLLM-9706] [doc] Update wide EP documents by @kaiyux in #9724
- [https://nvbugs/5666804][test] only adding sampler config for limited models by @ruodil in #9512
- [None][infra] Waive failed cases for main on 12/08 by @EmmaQiaoCh in #9773
- [None][chore] Move the rocketkv e2e test to post-merge by @lfr-0531 in #9768
- [None][chore] Enable tvm_ffi for cute dsl nvfp4_gemm to reduce host overhead. by @limin2021 in #9690
- [TRTLLM-9431][perf] Enable multistream for Linear Attention in Qwen3-… by @nv-guomingz in #9696
- [None][chore] Remove closed bugs by @xinhe-nv in #9770
- [None][infra] update mooncake in docker images by @zhengd-nv in #9584
- [None][test] Add Kimi k2 WIDEEP perf and accuracy cases by @fredricz-20070104 in #9686
- [https://nvbugs/5527655][test] Add test case for RCCA 5527655 by @fredricz-20070104 in #9511
- [http://nvbugs/5649010][fix] fix test_auto_scaling.py::test_worker_restart timeout by @reasonsolo in #9775
- [None][fix] Switch AutoDeploy's default allreduce strategy to NCCL by @MrGeva in #9666
- [TRTLLM-9506][fix] Fix AR for DeepSeek-R1 2 model path by @sunnyqgg in #9661
- [TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench by @FrankD412 in #9250
- [https://nvbugs/5567586][feat] Ampere xqa swa specdec for GPT-OSS Eagle3-one-model by @jhaotingc in #8383
- [TRTLLM-7967][chore] Add more tests by @yibinl-nvidia in #9415
- [https://nvbugs/5508267][fix] Proper handling of inactive canceled requests by @thorjohnsen in #9280
- [#8921][feat] Added symetric memory AllReduce strategy by @MrGeva in #8919
- [None][fix] Fix #8383 introduced TRTLLM backend python error by @jhaotingc in #9804
- [#9753][feat] AutoDeploy: Implement add rms_norm fusion by @nvchenghaoz in #9754
- [None][infra] Correct the waived test names due to a merge conflict by @yuanjingx87 in #9803
- [None][fix] Fix PDL in TRTLLM MOE for dsv3 by @dmtri35 in #9799
- [None][feat] Add llama4 scaling by @byshiue in #9771
- [https://nvbugs/5677746][fix] Use first PP rank's schedule result in other PP ranks to fix PP hang by @jiaganc in #9659
- [None][fix] Fix unterminated process issue for RemoteOpenAIServer by @JunyiXu-nv in #9490
- [https://nvbugs/5726066][infra] Waive timeout disaggregated/test_auto_scaling tests. by @bobboli in #9815
- [None][chore] Fix tests failing on pre-merge 12/08 by @brb-nv in #9819
- [https://nvbugs/5722653][fix] Fix config file used by disagg_client by @JunyiXu-nv in #9783
- [TRTLLM-6537][chore] Shorten the time limit for dis-agg accuracy testing by @Shixiaowei02 in #9614
- [None][infra] Use artifactory pypi mirror for Cython install by @ZhanruiSunCh in #9774
- [TRTLLM-9794][ci] remove duplicated test cases in DGX B200 by @QiJune in #9817
- [None][test] Refactor qa/llm_perf_nim.yml test list by @yufeiwu-nv in #9700
- [None][chore] Generate lock file for release/1.2.0rc4.post1 branch automatically by @yiqingy0 in #9829
- [None][fix] Additional model outputs for pipeline parallelism by @Funatiq in #9794
- [TRTLLM-6756][feat] Update BeamSearch for TorchSampler by @stnie in #9660
- [TRTLLM-9794][ci] move qwen3-next test cases to gb200 by @QiJune in #9827
- [None][infra] Waive failed cases for main branch on 12/09 by @EmmaQiaoCh in #9839
- [https://nvbugs/5575841] [fix] Nvbug 5575841: Remove additional test waivers for TestMoEFP4 by @DomBrown in #9788
- [None][feat] Make 2-model spec dec use the 1-model kernels (Hopper) by @mikeiovine in #8810
- [None][chore] Adding flaky auto scaling test to waives by @pcastonguay in #9851
- [#8921][chore] AutoDeploy NanoV3 to use SYMM_MEM allreduce strategy by @MrGeva in #9797
- [TRTINFRA-7328][infra] Consume SlurmCluster scratchPath and cleanup mounts by @mlefeb01 in #9600
- [https://nvbugs/5688388][chore] Unwaiving fixed disagg test by @pcastonguay in #9800
- [https://nvbugs/5719561][chore] Unwaive tests for nvbug 5719561 by @pcastonguay in #9801
- [https://nvbugs/5508301][feat] Move D->H copies to a worker thread whe… by @dhansen-nvidia in #8463
- [None][chore] Add unittest for otlp tracing by @zhanghaotong in #8716
- [None][chore] Support larger topK for NVLinkOneSided AlltoAll. by @bobboli in #9816
- [TRTLLM-9794][ci] move some deepseek test cases to gb200 by @QiJune in #9841
- [TRTLLM-9661][fix] Fix nvfp4 gemm allowed backends arg passing by @hyukn in #9837
- [https://nvbugs/5702791][fix] Unwaive fixed test by @dominicshanshan in #9844
- [TRTLLM-...
v1.1.0
Know Issue
-
If users create project with tensorrt-llm==1.1.0 in pyproject.toml file as dependency as below:
dependencies = [ "tensorrt-llm==1.1.0", ]
when users install project dependencies with command
uv sync, error will happend with message:No solution found when resolving dependencies for split (markers: python_full_version >= '3.13' and sys_platform == 'darwin'): ╰─▶ Because patchelf==0.18.0.0 was yanked (reason: https://github.com/mayeut/patchelf-pypi/issues/87) and tensorrt-llm==1.1.0 depends on patchelf==0.18.0, we can conclude that tensorrt-llm==1.1.0 cannot be used. And because your project depends on tensorrt-llm==1.1.0, we can conclude that your project's requirements are unsatisfiable.".That's because patchelf 0.18.0 was yanked by author.
A valid work around for this issue is to add block in pyproject.toml:
[tool.uv] override-dependencies = [ "patchelf==0.17.2.4", ]
What's Changed
- [None][chore] Bump version to 1.1.0rc0 by @yiqingy0 in #6651
- [TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in #6510
- [None][test] correct test-db context for perf yaml file by @ruodil in #6686
- [None] [feat] Add model gpt-oss by @hlu1 in #6645
- [https://nvbugs/5409414][fix] fix Not registered specs by @xinhe-nv in #6660
- [None][feat] : Add FP8 context MLA support for SM120 by @peaceh-nv in #6059
- [TRTLLM-6092][doc] Add LoRA feature usage doc by @shaharmor98 in #6603
- [TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) by @syuoni in #6300
- [TRTLLM-6881][feat] Include attention dp rank info with KV cache events by @pcastonguay in #6563
- [None][infra] Fix guardwords by @EmmaQiaoCh in #6711
- [None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in #6702
- [None][doc] Add deployment guide section to the official doc website by @nv-guomingz in #6669
- [None][fix] disagg ctx pp4 + gen pp4 integ test by @raayandhar in #6489
- [None][feat] Clean up ngram auto mode, add max_concurrency to configs by @mikeiovine in #6676
- [None][chore] Remove py_executor from disagg gh team by @pcastonguay in #6716
- [https://nvbugs/5423962][fix] Address broken links by @chenopis in #6531
- [None][fix] Migrate to new cuda binding package name by @tongyuantongyu in #6700
- [https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave by @symphonylyh in #6708
- [None][feat] Add NCCL Symmetric Integration for All Reduce by @Tabrizian in #4500
- [TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default by @dcampora in #6216
- [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6719
- [TRTLLM-5252][test] add for mistral_small_3.1_24b perf test by @ruodil in #6685
- [TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE by @StudyingShao in #6231
- [None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference by @zhanghaotong in #6626
- [TRTLLM-6854][feat] Enable guided decoding with disagg serving by @syuoni in #6704
- [TRTLLM-5252][fix] Propagate mapping to intermediate layers by @2ez4bz in #6611
- [None][test] fix yml condition error under qa folder by @ruodil in #6734
- [None][doc] Add doc for multimodal feature support matrix by @chang-l in #6619
- [TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell by @limin2021 in #6616
- [https://nvbugs/5436461][infra] Adjust free_gpu_memory_fraction of test_eagle3 to prevent OOM on CI by @leslie-fang25 in #6631
- [None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_required_layout by @yuxianq in #6654
- [https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend by @JunyiXu-nv in #6690
- [None][fix] Remove lock related typo in py_executor by @lancelly in #6653
- [None][feat] move kv cache measure into transfer session by @zhengd-nv in #6633
- [None][fix]revert kvcache transfer by @chuangz0 in #6709
- [TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding by @stnie in #6665
- [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #6184
- [None][feat] Optimize CUDA graph memory usage for spec decode cases by @mikeiovine in #6718
- [TRTLLM-7025] [infra] Reorganize CODEOWNERS to rectify
examplesmapping by @venkywonka in #6762 - [None][doc] Move AutoDeploy README.md to torch docs by @Fridah-nv in #6528
- [None][fix] WAR GPT OSS on H20 with Triton MOE by @dongfengy in #6721
- [TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick changes and minor fix by @yibinl-nvidia in #6493
- [None][feat] Core Metrics Implementation by @hcyezhang in #5785
- [https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in #6306
- [TRTLLM-6637][feat] Resolve KV cache divergence issue by @ziyixiong-nv in #6628
- [None][infra] Waive test main 0808 by @EmmaQiaoCh in #6751
- [#5048][enhance] AutoDeploy: Optimize prepare_inputs by @galagam in #6634
- [None][chore] Dead code elimination, we no longer record/fetch through WindowBlockManager:: mContextBlocksByHash by @eopXD in #6249
- [TRTLLM-6174][feat] Enable FP32 mamba ssm cache by @shaharmor98 in #6574
- [https://nvbugs/5444937][fix] Fixing kv_cache_event unit test by @pcastonguay in #6753
- [TRTLLM-6823][doc] Add checkpoint refactor docs by @shaharmor98 in #6592
- [None][feat] Support SharedTensor on MultimodalParams by @yechank-nvidia in #6254
- [None][feat] improve dataloading for benchmark_dataset by using batch… by @zerollzeng in #6548
- [https://nvbugs/5431127][fix] Run test_disaggregated_deepseek_v3_lite_fp8_nixl[DeepSeek-V3-Lite-fp8] only on hopper by @bo-nv in #6736
- [None][fix] fix same pp disagg by @chuangz0 in #6730
- [None][feat] Add gpt-oss GSM8K test. by @Tracin in #6732
- [None][test] Test trtllm-bench AD vs, PT BEs on H100 single gpu by @MrGeva in #6487
- [TRTLLM-5633][infra] Force set changed file diff to empty string for post-merge CI by @yiqingy0 in #6777
- [None][chore] remove closed bugs by @xinhe-nv in #6772
- [None][infra] Waive failed tests on main 0811 by @EmmaQiaoCh in #6778
- fix: Ensure that Python stub generation works against libnvidia-ml stubs by @MartinMarciniszyn in #6188
- [TRTLLM-5532][feat] store the block of context request into kv cache by @byshiue in #6683
- [None][doc] Add K2 tool calling examples by @lancelly in #6667
- [None][infra] Unwaive an updated case to test by @EmmaQiaoCh in #6791
- [None][chore] always try-catch when clear build folder in build_wheel.py by @zhenhuaw-me in #6748
- [TRTLLM-6812][feat] Add standardized GitHub issue templates and disable blank issues by @venkywonka in #6494
- [None][fix] Refactoring to avoid circular import when importing torch models by @rakib-hasan in #6720
- [None][chore] Find LLM_ROOT and LLM_BACKEND_ROOT dynamically by @achartier in #6763
- [https://nvbugs/5385987][fix] Fix Qwen2 quantization issue by pinning transformers version by @ch...
v1.2.0rc5
Announcement Highlights
-
Vulnerability
- Two security vulnerabilities have been identified in the urllib3 package versions >= 1.24 and < 2.6.0. These issues will be addressed in the next release. For detailed information on the vulnerabilities, refer to the following advisories:
GHSA-gm62-xv2j-4w53
GHSA-2xpw-w6gg-jr37
To mitigate the issues immediately, users are advised to upgrade urllib3 to version 2.6.0 or later.
- Two security vulnerabilities have been identified in the urllib3 package versions >= 1.24 and < 2.6.0. These issues will be addressed in the next release. For detailed information on the vulnerabilities, refer to the following advisories:
-
Model Support
- Slimmed down implementation of Nemotron H (#9235)
- Add support Starcoder2 PyTorch backend (#8923)
- Add support MLA chunked prefill for DeepSeek V3.2 model (#9376)
- Add support AutoDeploy Nemotron-Flash (#9504)
- AutoDeploy: Add support Llama4 MoE handling (#9556)
- Add support for nano-v3 and super-v3 with PyTorch backend (#9261)
- AutoDeploy: Add support for nano v3 to custom implementation (#9465)
-
API
-
Feature
- Add support for KVCache reuse for DeepSeek V3.2 (#9383)
- Support Yarn on QwQ-32B model (#9059)
- Update DeepGEMM to include optimizations for DeepSeek-v3.2 (#9380)
- Cold L2 cache when doing autotune benchmarking (#8779)
- Improve TRTLLM MoE throughput for small hidden size (#9377)
- Add parser to layer-wise benchmarks (#9440)
- Support custom chat template for tool calling (#9297)
- Add draft token tree runtime on CDL (#8586)
- Top-p optimization by removing redundant softmax (#9411)
- Use FlashInfer's top_k_sampling_from_probs (#9457)
- Overlap context chunks in pipeline parallel mode (#9308)
- Improve all-to-all perf for large CP size in Helix (#9494)
- Support more accurate AR calculation (#9323)
- Support custom config of sharding (#9143)
- Integrate helix parallelism (#9342)
- Optimize RocketKV algorithm (#9333)
- Extend cute_dsl_nvfp4_gemm to sm103 (#9543)
- Add chat template kwargs support to longbench-v2 (#9544)
- Add Beam Search to TorchSampler (#8509)
- Unify nvfp4 gemm backend (#8963)
- Use FlashInfer.sampling by default (#9545)
- Add RocketKV usage doc and e2e accuracy test on LongBenchV2 (#9572)
- Alias to comply to LlmArgs (#9586)
- Update trtllm-gen nvfp4 kernels with better performance (#9510)
- Enable CuteDSL MoE with Large EP (#9592)
- Convert cuteDSL GEMM to opt-in feature (#9682)
- Optimize the load_weights method to include mapping parameter (#9583)
- Support torch compile for pipeline parallel Llama and DeepSeekV3 (#7838)
- Check if executor is shutdown in /health entrypoint (#9057)
- Add NIXL-LIBFABRIC support (#9225)
- Decouple disagg service from FastAPI (#8714)
- AutoDeploy: Add NVFP4 Cutlass MoE kernels (#9551)
- AutoDeploy: Draft Target Speculative Decoding (#9275)
- AutoDeploy: Support TRTLLM Sampler (#9641)
- AutoDeploy: Perf optimization for Attention and rmsnorm (#9719)
- AutoDeploy: Use router gemm op for Nemotron MOE (#9500)
- AutoDeploy: Remove redundant copies in mamba layers (#9461)
- AutoDeploy: Add A_log fusion for Mamba layers (#9422)
- AutoDeploy: Update dist ops (if not already) (#9301)
- AutoDeploy: Perf optimization entries (if not already in Feature) (#9719)
-
Fix
- Modify qwen3-next sampling stop_tokens (#9331)
- Fix mismatched nvfp4 gemm sf shape (#9336)
- Enhance warning in cacheTransBuffer (#9390)
- Fix top-k outIndices with vectorized_process (#9404)
- Let KV cache manager block initialization respect dry run (#9093)
- Avoid cudaFree overlap with cuda graph (#9438)
- Fix TP support for DeepSeek-V3.2 on Hopper (#9484)
- Fix Qwen3-235B ATP accuracy issue with PDL (#9530)
- Correct virtual memory allocation alignment (#9491)
- Fix view operation on uncontiguous tensor (#9576)
- Extract GPU count from single-node stage names (#9599)
- Refine Piecewise Cuda Graph condition for DP (#9393)
- Enhance RPC robustness (#8711)
- Fix synchronization bugs in KvCacheTransferManager preventing corrupted blocks (#9056)
- Fix dist-serving performance by clearing CPU affinity (#9549)
- Fix wide ep MoE error (#9642)
- Fix LoRa enablement for GPT OSS Torch (#8253)
- Recover TRTLLM MoE performance for DEP (#9562)
- Fix error when processing batches containing both text and multimodal data (#8381)
- Fix deepseek_fp8_block_scales using 2D x_sf in TRTLLMGEN-MoE (#9658)
- Enable hmac in RPC (#9745)
- Start disagg workers and servers on free ports (#9694)
- Fix bug: deepseek_fp8_block_scales uses 2D x_sf instead of 1D (#9658)
- AutoDeploy: fix nano sharding config (#9668)
- AutoDeploy: Remove auto-tuner from nvfp4_gemm forward (#9497)
-
Documentation
- Fix math formula rendering issues (#9481)
- Qwen3 deployment guide (#9488)
- KV Connector Docs (#9325)
- Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell (#9711)
- Add feature docs for helix parallelism (#9684)
- Add examples showcasing OpenAI compatible APIs (#9520)
- Update Linux installation guide (#9485)
- Refine the slurm examples (#9548)
- Link to modelopt checkpoints in quick start guide (#9571)
-
Test & Infra
- Rename AlltoAll backend names (#9329)
- Move build config from BaseLlmArgs to TrtLlmArgs (#9249)
- Reduce nested nvtx ranges (#9347)
- Add disagg and wideep multi-node multi-gpu test cases (#9356)
- Upgrade CuteDSL to 4.3.0 (#9444)
- Use flexcache for gh200 nodes (#9405)
- Evaluate helix parallelism with DSV3 Lite (#9597)
- AutoDeploy update cuda stream manager for multi-device (#9575)
- Add container notices and documentation (#9185)
- Increase warmup times in multi-gpu testing (#9578)
What's Changed
- [#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models by @nvchenghaoz in #9317
- [#9096][feature] Auto Deploy: configurable fused MoE backend by @nzmora-nvidia in #9194
- [None][fix] Use fp32 for indexer weight_proj GEMM by @chang-l in #9243
- [None][fix] Multimodal InputProcessor dummy builder fix by @yechank-nvidia in #8916
- [None][ci] waive test_disagg_server_restart by @QiJune in #9326
- [None][chore] Revise the description of enable_autotuner. by @hyukn in #9320
- [TRTLLM-9295][fix] use greedy decoding in test_openai_compatible_json_schema by @ixlmar in #9305
- [TRTLLM-9164][infra] Enable checking duplicate items in waives.txt in pre-commit by @EmmaQiaoCh in #9265
- [#9236][feature] Make sharing of activation_type across SW layers more robust by @nzmora-nvidia in #9238
- [https://nvbugs/5667687][fix] Set correct lm_head_tp_size_upper_bound by @lancelly in #9300
- [https://nvbugs/5667454][test] Fix Test Case as Chunked Attention not Supported on sm_120 by @yufeiwu-nv in #9260
- [None][chore] Weekly mass integration of release/1.1 by @mikeiovine in #8918
- [None][chore] Upgrade starlette and FastAPI by @tburt-nv in #9319
- [None][infra] Update goggles_action repository by @karljang in #9240
- [TRTLLM-9197][infra] Move thirdparty stuff to it's own listfile by @cheshirekow in #8986
- [TRI-332] [fix] Fix L0_backend_trtllm by @yinggeh in #9282
- [None][ci] waive test_llm_context_only_timed_out_kv_cache_exhausted by @QiJune in #9351
- [None][infra] Add fallback when get wheel from build stage is fail by @ZhanruiSunCh in #9290
- [TRTLLM-9183][infra] Add --waives-file in rerun pytest command by @yiqingy0 in #8971
- [TRTLLM-8957][feat] create communication related classes by @xxi-nv in #8968
- [None][chore] Add periodic junit xml path in conftest by @crazydemo in #9337
- [None][ci] waive a test case of test_ad_build_small_multi.py by @QiJune in #9355
- [None][infra] Waive failed cases in main post-merge on 11/21 by @EmmaQiaoCh in #9360
- [None][chore] Bump version to 1.2.0rc4 by @yiqingy0 in #9363
- [TRTLLM-8650][fix] beam search request validation (#8433) by @ixlmar in #9228
- [TRTLLM-9191][feat] support out-of-tree models in trtllm-serve by @ixlmar in #9269
- [https://nvbugs/5629833][fix] Don't fill tensors by @HuiGao-NV in #9296
- [None][feat] TRT-LLM Gen MoE optimize DeepSeek Fp8 activation kernel by @nekorobov in #9175
- [https://nvbugs/5590408][fix] Fallback to greedy sampling in two-model overlap scheduler by @ziyixiong-nv in #9321
- [TRTLLM-9208][infra] Document the process for C++ deps by @cheshirekow in #9016
- [TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) by @syuoni in #9288
- [None][feat] Eagle: PostNorm and multilayer options by @IzzyPutterman in https:...
v1.2.0rc4
Highlights
-
Model Support
-
API
- Support out-of-tree models in
trtllm-serve(#9269)
- Support out-of-tree models in
-
Feature
-
Fix
- Use fp32 for indexer
weight_projGEMM (#9243) - Fix multimodal
InputProcessordummy builder (#8916) - Set correct
lm_head_tp_size_upper_bound(#9300) - Move
torch.cuda.Streamout of criticaltorchcomputation region (#8494) - Fix
trtllm-llmapi-launchport conflict (#8582) - Rework
DisaggPPTerminationHandlerto fix hang issue (#8519) - Overwrite only if
default_max_tokensis legal (#8538) - Fix block range index (#8470)
- Restrict FP8 blockscale MoE case to valid configurations (#8583)
- Fix
L0_backend_trtllmbehavior (#9282) - Improve beam search request validation (#9228)
- Avoid incorrectly filling tensors with 0 (#9296)
- Fallback to greedy sampling in two-model overlap scheduler to improve stability (#9321)
- Use fp32 for indexer
-
Documentation
-
Benchmark
- Set
max_batch_size=1to stabilize accuracy test results (#8609)
- Set
-
Test & Infra
- Use greedy decoding in
test_openai_compatible_json_schema(#9305) - Enable checking duplicate items in
waives.txtin pre-commit (#9265) - Fix test case where chunked attention is not supported on
sm_120(#9260) - Add
NCCL_DEBUG=INFOflag to collect more information on CI failures (#8440) - Remove multimodal test cases using TRT backend (#8611)
- Clean cache for easily hanging test cases (#8619)
- Enable relaxed acceptance test on Blackwell (#8709)
- Update linter rules for mass integration (#8918)
- Upgrade
starletteandFastAPIdependencies (#9319) - Update
goggles_actionrepository (#9240) - Move third-party components to their own list file (#8986)
- Add fallback when fetching wheel from build stage fails (#9290)
- Add
--waives-fileflag in rerunpytestcommand (#8971) - Add periodic JUnit XML path in
conftest(#9337) - Consume
SlurmClustersshPortfor clusters with custom SSH port (#9313) - Add one-model and overlap-scheduling to Eagle tests for GPTOSS (#9312)
- Use greedy decoding in
What's Changed
- [#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models by @nvchenghaoz in #9317
- [#9096][feature] Auto Deploy: configurable fused MoE backend by @nzmora-nvidia in #9194
- [None][fix] Use fp32 for indexer weight_proj GEMM by @chang-l in #9243
- [None][fix] Multimodal InputProcessor dummy builder fix by @yechank-nvidia in #8916
- [None][ci] waive test_disagg_server_restart by @QiJune in #9326
- [None][chore] Revise the description of enable_autotuner. by @hyukn in #9320
- [TRTLLM-9295][fix] use greedy decoding in test_openai_compatible_json_schema by @ixlmar in #9305
- [TRTLLM-9164][infra] Enable checking duplicate items in waives.txt in pre-commit by @EmmaQiaoCh in #9265
- [#9236][feature] Make sharing of activation_type across SW layers more robust by @nzmora-nvidia in #9238
- [https://nvbugs/5667687][fix] Set correct lm_head_tp_size_upper_bound by @lancelly in #9300
- [https://nvbugs/5667454][test] Fix Test Case as Chunked Attention not Supported on sm_120 by @yufeiwu-nv in #9260
- [None][chore] Weekly mass integration of release/1.1 by @mikeiovine in #8918
- [None][chore] Upgrade starlette and FastAPI by @tburt-nv in #9319
- [None][infra] Update goggles_action repository by @karljang in #9240
- [TRTLLM-9197][infra] Move thirdparty stuff to it's own listfile by @cheshirekow in #8986
- [TRI-332] [fix] Fix L0_backend_trtllm by @yinggeh in #9282
- [None][ci] waive test_llm_context_only_timed_out_kv_cache_exhausted by @QiJune in #9351
- [None][infra] Add fallback when get wheel from build stage is fail by @ZhanruiSunCh in #9290
- [TRTLLM-9183][infra] Add --waives-file in rerun pytest command by @yiqingy0 in #8971
- [TRTLLM-8957][feat] create communication related classes by @xxi-nv in #8968
- [None][chore] Add periodic junit xml path in conftest by @crazydemo in #9337
- [None][ci] waive a test case of test_ad_build_small_multi.py by @QiJune in #9355
- [None][infra] Waive failed cases in main post-merge on 11/21 by @EmmaQiaoCh in #9360
- [None][chore] Bump version to 1.2.0rc4 by @yiqingy0 in #9363
- [TRTLLM-8650][fix] beam search request validation (#8433) by @ixlmar in #9228
- [TRTLLM-9191][feat] support out-of-tree models in trtllm-serve by @ixlmar in #9269
- [https://nvbugs/5629833][fix] Don't fill tensors by @HuiGao-NV in #9296
- [None][feat] TRT-LLM Gen MoE optimize DeepSeek Fp8 activation kernel by @nekorobov in #9175
- [https://nvbugs/5590408][fix] Fallback to greedy sampling in two-model overlap scheduler by @ziyixiong-nv in #9321
- [TRTLLM-9208][infra] Document the process for C++ deps by @cheshirekow in #9016
- [TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) by @syuoni in #9288
- [None][feat] Eagle: PostNorm and multilayer options by @IzzyPutterman in #9233
- [TRTLLM-9082][feat] AutoDeploy: Move the moe Align kernel to AOT by @nvchenghaoz in #9106
- [#9388][fix] AutoDeploy: Fix cutlass BF16 MoE kernel invocation by @nzmora-nvidia in #9339
- [TRTINFRA-7326][infra] - Consume SlurmCluster sshPort for clusters with custom SSH port by @mlefeb01 in #9313
- [None][test] Add one-model and overlap-scheduling to eagle tests for GPTOSS by @dongfengy in #9312
Full Changelog: v1.2.0rc3...v1.2.0rc4
Release v1.2.0rc3
Announcement Highlights
-
Model Support
-
API
- Add
trtllm_prefix for exposed metrics (#8845) - Return logprobs incrementally in torch backend (#8785)
- Enable n > 1 in OpenAI API with PyTorch backend (#8951)
- Support json_schema in response_format (#8934)
- Add TRTLLM_NIXL_KVCACHE_BACKEND environment variable for NIXL backend selection (#9075)
- Prevent negative
max_tokenspassed into tllm request (#9037)
- Add
-
Feature
- Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt (#8771)
- Add swapsMmaAb sparseMla kernels (#8913)
- Implement Deep Research with scaffolding (#8452)
- Add rope and uk-bgemm overlap for MLA generation (#8495)
- Add NUMA-aware CPU affinity autoconfig (#8805)
- Add custom indexer k cache scatter op (#8960)
- Allow env variable to specify spawn process IPC address (#8922)
- Implement sampling using FlashInfer.sampling (#8581)
- Enhance the overlap scheduler for two-model spec decoding (#8706)
- Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011)
- Unify MPI & Ray's req/response handling with RPC Client/Server (#8765)
- Use triton kernels for RocketKV prediction module (#8682)
- Support accuracy test and install from wheel (#9038)
- Add tree attention support for blackwell arch (#8975)
- Add simple optimizations for MTP 2-model (#9176)
- Enable early exit with overlap scheduler (#8587)
- Add dynamic draft length in spec decode (stage 1) (#8194)
- Add bias for FP4 TRT-LLM Gen MoE (#9220)
- Integrate CuteDSL NVFP4 grouped GEMM (#8880)
- Add ability to cancel disagg request if KV cache resources are exhausted (#9155)
- Make factory sharding the default (#9144)
- Enable simple sharding for latent experts (#9099)
- Update the indexer topK (#9255)
- Add fp8 dense for sm120 (#9174)
- Add specdec to nemotron nas (#8985)
- Use CUDAGraph to improve the tuning accuracy for AutoTuner (#9089)
- Add ReLU2 to TRTLLM Cutlass MoE BF16 kernels (#9191)
- Add pp_partition to customize each rank's layer number (#9003)
- Enable EPLB for trtllm-gen and cutlass backend (#8886)
- Add optimized trtllm-gen attention kernels on sm103 (#9081)
- Add MTP>1 support for DS-v3.2 (#9045)
-
Benchmark
- Add Qwen3-Next to layer-wise benchmarks (#9065)
- Refactor benchmark infrastructure (#9207)
- Print device info in trtllm-bench report (#8584)
- Use torch.compile to fuse copy + layernorm within the LayerNorm module (#9052)
- Add torch.compile + multi-stream support for k-cache scatter and weight scaling (#8988)
- Adjust select_alltoall_method_type (#8950)
-
Documentation
- Replace the relative links with absolute links in README.md (#8995)
- Update llama and llama4 example doc (#9048)
- Update doc/tests/chat_template for nano-v2-vlm (#8840)
- Add Mixed Precision Context and Generation section to Disagg (#8769)
- Add DeepSeek-V3.2-Exp document (#9141)
- Update docs for EPLB (#9166)
- Update the Flux autodeploy example (#8434)
- Update DS-R1 example doc (#9231)
- Update license (#8807)
-
Fix & Infra
- Fix the logger once key issue and further compress log in AutoTuner (#8873)
- Fix disagg GPT-OSS test (#8870)
- Remove PyTorchConfig completely (#8856)
- Fix boost issue (#8996)
- Lock onnx version <1.20.0 and remove WAR for TRT 10.13 (#9006)
- Fix eagle3 accuracy issue on sm120 (#8944)
- Add customized topk and related unit tests for DSA (#8882)
- Improve type annotations on ResourceManager.get_resource_manager (#9013)
- Add sm103 to CutlassFP8RowwiseGemm (#9042)
- Add context manager to fix FakeTensorProp (#9047)
- Initialize HF modules in worker_main for models with trust_remote=true (#8931)
- Use async
send_requests_to_next_pp(#9041) - Display the GPU memory information in GiB unit (#9070)
- Add unit tests for TorchSampler batched sampling (#9012)
- Remove circular dependency between model engine and cuda graph runner (#7572)
- Fix precision issue due to KV layout mismatch for split/concat kernels (#6917)
- Clear indexer k cache reference before releasing CUDA memory (#9110)
- Disable UCC as WAR to MPI allgather issue before NGC PyTorch 25.12 upgrade (#9126)
- Fix KV cache manager test warnings (#9103)
- Fix the aux_stream in Llama4MinLatencyFusedMoE (#9035)
- Avoid
torch.compilebeing applied multiple times (#9135) - Upgrade tritonserver DLFW 25.10 (#8929)
- Make the sliced nvfp4 output contiguous (#9123)
- Update the attention layers counting for Qwen3-next (#9072)
- Fix the rank to access
all_rank_chunk_size_listwhen chunked MoE is used (#8723) - Fix missing
ActivationTypeissue (#9171) - Support enroot/pyxis clusters in multi-node SLURM and enable oci-hsg GB200 in post-merge (#9117)
- Fix lock file generation script (#9180)
- Fix a deepseekv3 error when debug mode is on (#9217)
- Fix DeepSeek V3.2 indexer RoPE (#9232)
- Exclude number of draft tokens from
mMaxSeqLenKv(#9210) - Upgrade NIXL to 0.7.1 (#9055)
- Fix EPLB for DeepSeek-V3.2-Exp (#9245)
- Log the LLM args for main branch (#9120, #9205)
- Update TRTLLM MoE cubins, reduce mxfp4 weight padding requirement, and tighten TMA bound (#9025)
- Upgrade precommit-hooks to v6.0.0 (#9097)
What's Changed
- [https://nvbugs/5623960][fix] Fix the logger once key issue and further compress log in AutoTuner. by @hyukn in #8873
- [None][infra] update github token name by @niukuo in #8907
- [https://nvbugs/5624367][fix] Fix disagg GPT-OSS test by @chuangz0 in #8870
- [https://nvbugs/5630345][chore] unwaive DS-v32 nvfp4 and fp8 tests by @lfr-0531 in #8887
- [TRTLLM-7251][test] Get submit eplb slots empty key work by @fredricz-20070104 in #8945
- [TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt by @chang-l in #8771
- [None][feat] add swapsMmaAb sparseMla kernels by @PerkzZheng in #8913
- [TRTLLM-8201][feat] Nemotron H MoE Sharding by @lucaslie in #8744
- [#8924][fix] Fix AutoDeploy pattern matcher for torch 2.9 by @Fridah-nv in #8920
- [https://nvbugs/5606166][fix] AutoDeploy: unwaive test for use tuples for cudagraph shape lookup by @lucaslie in #8957
- [None][feat] Deep Research Implemented with Scaffolding by @Boreas618 in #8452
- [None][infra] allow to choose repo when generate lock files by @yuanjingx87 in #8659
- [None][feat] add waive by sm version by @xinhe-nv in #8928
- [None][feat] Add
trtllm_prefix for exposed metrics by @nv-yilinf in #8845 - [TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation by @yunruis in #8495
- [https://nvbugs/5630345] [chore] skip deepseek-v3.2 fp8 kv tests on pre-Blackwell architectures by @lfr-0531 in #8973
- [None][chore] Use cached model in all ray tests by @shuyixiong in #8962
- [https://nvbugs/5498478][fix] Fix eagle3 fp8 kv target model + bf16 draft model + chunked prefill by @DylanChen-NV in #8910
- [TRTLLM-8814][feat] AutoDeploy: Use TRTLLM kernels for FP8 linear by @nvchenghaoz in #8820
- [https://nvbugs/5527655][feat] Add NUMA-aware CPU affinity autoconfig by @dhansen-nvidia in #8805
- [None][feat] AutoDeploy: Support Latent MOE for Nemotron by @nvchenghaoz in #8955
- [None][fix] Fix KV cache clearing with KV Connector API by @jthomson04 in #8750
- [https://nvbugs/5637012][fix] Bugfix when config is None for MLA by @chang-l in #8978
- [https://nvbugs/5606136][ci] Remove tests for deprecating models. by @SimengLiu-nv in #8926
- [None][feat] Return logprobs incrementally in torch backend by @dcaox in #8785
- [https://nvbugs/5636986][fix] Fix DeepGemmMoe get_buffer calls by @VALLIS-NERIA in #8939
- [None][fix] Switch AD AllReduce strategy to NCCL by @MrGeva in #8979
- [https://nvbugs/5633340][fix] kill processes properly after test by @reasonsolo in #8970
- [TRTLLM-9065][chore] remove PyTorchConfig completely by @QiJune in #8856
- [https://nvbugs/5508536][fix] Take Over (#8627): Reintroduce: Move stop_criteria to sample_async (#7041) by @stnie in #8794
- [None][fix] type annotations in fuse_input_embeds by @ixlmar in #8976
- [None][fix] add missing CLI option in multimodal example by @ixlmar in #8977
- [None][chore] Bump version to 1.2.0rc3 by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pu...
v1.2.0rc2
Announcement Highlights
-
Model Support
- Optimize the routing kernel for DeepSeek V3; add MoE TRTLLM backend support for KimiK2 and Qwen-next (#7761)
- Support DeepSeek V3.2 with FP8/BF16 KV cache and NVFP4/BF16 KV cache (#8405)
- Add EVS support for nano-v2-vlm (#8024)
- Support Qwen3 reasoning and tool parsers (#8000, #8216)
- Add Nemotron MOE support in AutoDeploy, including FP8 MOE (#8469, #8737, #8599)
-
API
-
Feature
- Add cuBLASLt NVFP4 GEMM backend (#7943)
- Add FP8 rowwise GEMMs for B200 (#8332)
- Enable low-precision alltoall for CUTLASS/TRTLLMGen (#8675)
- Integrate MNNVL Throughput and refactor allreduce kernel for TRTLLM MoE (#8728, #8018)
- Enable RMS norm fusion for Nemotron MOE (#8563)
- Add base64 video input support (#8458)
-
Fix & Infra
- Upgrade to DLFW 25.10, PyTorch 2.9.0, and Triton 3.5.0 (#8838)
- Fix FP8 blockwise GEMM performance with attention DP (#8501)
- Fix pipeline-parallel bubbles (#8687)
- Cache the AllReduce wrapper to avoid re-allocation hangs (#8803)
- Stabilize tests/CI with waives and slurm/CI updates (#8524, #8573, #8749, #8775, #8808, #8896, #8897)
-
Benchmark
-
Documentation
-
Known issue
- For this pre-release version, install using the specific version identifier:
pip3 install tensorrt-llm==1.2.0rc2. Installing withpip3 install tensorrt-llm --prewill result in a broken dependency ononnx==1.20.0rc1. This issue will be resolved in the next release.
- For this pre-release version, install using the specific version identifier:
What's Changed
- [None][chore] update test duration by @xinhe-nv in #8377
- [None][fix] Avoid overwrite of
kv_cache_config.max_tokensfor VSWA scheme for the KVCacheManager by @eopXD in #8219 - [TRTLLM-4866] [test] Support waiving unit tests by waives.txt by @VALLIS-NERIA in #8359
- [TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for 384 experts (MoE TRTLLM backend) by @ChristinaZ in #7761
- [https://nvbugs/5542862][fix] Upgrade fmha_v2. by @yuxianq in #8364
- [TRTLLM-8669][infra] Use artifactory mirror for install python by @ZhanruiSunCh in #8394
- [TRTLLM-7255][feat] Add iteration log parser script for benchmark log by @yizhang-nv in #6942
- [None][ci] move some test cases from H100 to A10 by @QiJune in #8449
- [TRTLLM-8436][feat] batched sampling and top-k logprobs improvements by @ixlmar in #8398
- [None][feat] Update devcontainer configuration to include additional extensions by @Funatiq in #8369
- [https://nvbugs/5540752][fix] Support quantized Phi4 MM models by @pamelap-nvidia in #8190
- [https://nvbugs/5492250][fix] Remove isolated cases and unwaive cases by @HuiGao-NV in #8492
- [TRTLLM-6055][infra] Slurm Test refactor by @yuanjingx87 in #7176
- [https://nvbugs/5568676][fix] Remove test waive by @dongfengy in #8437
- [#8461][feat] AutoDeploy: trtllm-serve bug fix + unit test by @lucaslie in #8462
- [None] [chore] Add architecture-specific ATTRIBUTIONS files by @venkywonka in #8468
- [#8272][feat] Enable chunked prefill for SSMs in AutoDeploy by @suyoggupta in #8477
- [None][feat] Update 3rdparty/DeepGEMM to latest commit by @ruoqianguo in #8488
- [None][feat] Support kv_cahce_reuse for HyperCLOVAX-Vision model by @yechank-nvidia in #7789
- [TRTLLM-8436][fix] restore list[list[list[int]]] in add_token by @ixlmar in #8502
- [None][chore] Move submit.sh to python and use yaml configuration by @zerollzeng in #8003
- [TRTLLM-7287][test] add multimodal chunked_prefill cases by @ruodil in #8011
- [None][feat] Add alltoall to trtllm-gen MoE backend. by @bobboli in #8481
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #8486
- [None][ci] rebalance H100 stages by @QiJune in #8491
- [None][feat] Support Qwen3 reasoning parser by @LinPoly in #8000
- [None][infra] Add split algorithm for slurm by @EmmaQiaoCh in #8516
- [TRTLLM-8638][fix] Remove closed bugs by @xinhe-nv in #8478
- [None][chore] Update feature combination matrix for SWA kv cache reuse by @eopXD in #8529
- [None][fix] the api_stability unify default values of None and inspect._empty by @Superjomn in #8496
- [None][infra] Waive failed tests for main 10/21 by @EmmaQiaoCh in #8524
- [None][doc] Facilitates the integration of the transfer agent by @Shixiaowei02 in #7867
- [TRTLLM-8160][feat] Add max_total_draft_tokens by @yweng0828 in #8366
- [None][chore] AutoDeploy: replace HF's deprecated keyword torch_dtype --> dtype by @lucaslie in #8510
- [TRTLLM-7843][feat] implement disagg cluster auto-scaling by @reasonsolo in #8215
- [None][feat] AutoDeploy: Add Nemotron MOE support for AutoDeploy by @nvchenghaoz in #8469
- [TRTLLM-8483][chore] Refine scheduler_config and peft_cache_config in create_py_executor by @leslie-fang25 in #8451
- [https://nvbugs/5556020][fix] test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_eagle3 dimension mismatch by @sunnyqgg in #8517
- [None][doc] Fix the incorrect doc figure by @Shixiaowei02 in #8536
- [TRTLLM-8260][feat] Add Server-Client Perf Test in pytest for B200 and B300 by @chenfeiz0326 in #7985
- [None][infra] Let CI continue running other isolation tests when an isolation test get hanging by @EmmaQiaoCh in #8471
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #8554
- [None][feat] Add vLLM KV Pool support for XQA mla kernel by @qsang-nv in #8560
- [https://nvbugs/5451272][fix] unwaive the test by @Shixiaowei02 in #8537
- [None][chore] Bump version to 1.2.0rc2 by @yiqingy0 in #8562
- [None][doc] Paragraph adjustment and fix statistic by @yunruis in #8568
- [None][infra] Waive failed cases for main branch 10/22 by @EmmaQiaoCh in #8573
- [TRTLLM-8785][fix] fix conflicts between periodic-junit and store-durations by @crazydemo in #8518
- [https://nvbugs/5594753][fix] fix rpc unique addr related issue by @Superjomn in #8419
- [#8391][fix] check perf by device subtype by @MrGeva in #8428
- [None][chore] replace print_colored_debug with logger_debug by @Superjomn in #8417
- [None][fix] generate nanobind stubs for submodules by @ixlmar in #8539
- [None][fix] fixed cached model path in test by @MrGeva in #8549
- [None][chore] add precommit hook to remove redundant tab and white space by @xinhe-nv in #8534
- [https://nvbugs/5429636][feat] Kv transfer timeout by @pcastonguay in #8459
- [None][fix] Fix EPLB CPU thread NUMA binding by @dongxuy04 in #8579
- [None][chore] Skip failing import of mxfp4_moe by @brb-nv in #8591
- [TRTLLM-8754][chore] Refine PyTorchModelEngine with llm args by @leslie-fang25 in #8493
- [TRTLLM-8682][chore] Remove auto_parallel module by @anish-shanbhag in #8329
- [None][feat] Update TRTLLM MoE MxFP4 cubins; autotune tileN...