Releases · NVIDIA/Megatron-LM

@svcnvidia-nemo-ci

Changelog Details

beep boop 🤖: Bumping versions by @svcnvidia-nemo-ci :: PR: #4349
cp: NVFP4 native weights for DDP (4005) into core_r0.17.0 by @ko3n1g :: PR: #4290
docs: bump project.json and versions1.json to 0.17.0 by @ko3n1g :: PR: #4361
[docs] ci: fix version picker in 0.17.0 docs by @ko3n1g :: PR: #4363
[docs] ci: use parent-relative json_url for version picker by @ko3n1g :: PR: #4366
Backport NVRx async checkpoint compatibility to core_r0.17.0 by @sbak5 :: PR: #4453
cp: add permute fusion into hybrid ep (4089) into core_r0.17.0 by @ko3n1g :: PR: #4488
cp: get rid of weights_only=False (4434) into core_r0.17.0 by @ko3n1g :: PR: #4554
cp: SafeUnpickler class for safe pickle usage (4319) into core_r0.17.0 by @ko3n1g :: PR: #4555
cp: checkpoint integrity verification (4305) into core_r0.17.0 by @ko3n1g :: PR: #4556
fix(async_ckpt): import inspect in async_utils on core_r0.17.0 by @ko3n1g :: PR: #4597
chore(beep boop 🤖): Bump uv.lock (core_r0.17.0) (2026-05-04) by @svcnvidia-nemo-ci :: PR: #4598
cp: fix: Replace polynomial rolling hash with SHA-256 for prefix caching (#4158) by @chtruong814 :: PR: #4612
build: relax transformers cap to <=5.3.0 on core_r0.17.0 by @ko3n1g :: PR: #4701
chore: Bump TE to latest 2.14 by @chtruong814 :: PR: #4772
cp: additional tests for nvrx (#4522) by @chtruong814 :: PR: #4826
Release 0.17.0 by @ko3n1g
Bump mfsdp to 0.4.0 by @ko3n1g
cp: NVFP4 native weights for DDP (4005) into core_r0.17.0 (#4290) by @ko3n1g
docs: bump project.json and versions1.json to 0.17.0 (#4361) by @ko3n1g
[docs] ci: fix version picker in 0.17.0 docs (#4363) by @ko3n1g
[docs] ci: use parent-relative json_url for version picker (#4366) by @ko3n1g
chore(beep boop 🤖): Bump (core_r0.17.0) (2026-04-20) by @github-actions[bot]
Backport NVRx async checkpoint compatibility to core_r0.17.0 (#4453) by @sbak5
add permute fusion into hybrid ep (#4089) by @Autumn1998
Merge pull request #4488 from NVIDIA/cherry-pick-4089-core_r0.17.0 by @ko3n1g
get rid of weights_only=False (#4434) by @dimapihtar
SafeUnpickler class for safe pickle usage (#4319) by @dimapihtar
checkpoint integrity verification (#4305) by @dimapihtar
Merge pull request #4554 from NVIDIA/cherry-pick-4434-core_r0.17.0 by @ko3n1g
Merge pull request #4555 from NVIDIA/cherry-pick-4319-core_r0.17.0 by @ko3n1g
Merge pull request #4556 from NVIDIA/cherry-pick-4305-core_r0.17.0 by @ko3n1g
fix(async_ckpt): import inspect in async_utils on core_r0.17.0 (#4597) by @ko3n1g
chore(beep boop 🤖): Bump uv.lock (core_r0.17.0) (2026-05-04) (#4598) by @svcnvidia-nemo-ci
cp: fix: Replace polynomial rolling hash with SHA-256 for prefix caching (#4158) (#4612) by @chtruong814
build: relax transformers cap to <=5.3.0 on core_r0.17.0 (#4701) by @ko3n1g
chore(beep boop 🤖): Bump (core_r0.17.0) (2026-05-11) by @github-actions[bot]
chore: Bump TE to latest 2.14 (#4772) by @chtruong814
cp: additional tests for nvrx (#4522) (#4826) by @chtruong814
chore(beep boop 🤖): Bump (core_r0.17.0) (2026-05-18) by @github-actions[bot]
chore(beep boop 🤖): Bump (core_r0.17.0) (2026-05-25) by @github-actions[bot]

@deepakn94

Changelog Details

Fix two minor bugs in MTP implementation for hybrid models by @deepakn94 :: PR: #3194
Update README.md by @mvirts :: PR: #2111
mRoPE for MTP by @BestJuly :: PR: #3114
Fix bug in SFTDataset by @duncanriach :: PR: #3185
Fix several syntax error by @HollowMan6 :: PR: #3004
Fix for RL Test by @wdykas :: PR: #3148
Fix latent moe flops and backward_dw by @buptzyb :: PR: #2977
Use global user buffer when the bucket size does not fit FixedPoolAllocator by @shengf-nv :: PR: #2857
ci: Checkpoint retention by @ko3n1g :: PR: #3205
Add unit test for LatentMoE by @venmugil :: PR: #2892
ci: Enable unit tests on merge-queue by @ko3n1g :: PR: #3186
Fix seq pack flag in get_logprobs by @mathemakitten :: PR: #3206
ci(fix): Parse unit tests in merge-queue by @ko3n1g :: PR: #3224
Fix TE 2.12 AllGather CI failure by @BestJuly :: PR: #3101
ci(hotfix): Pin uv by @ko3n1g :: PR: #3233
Add a unit test to check that RL get_logprobs will reuse training cudagraphed forward pass by @mathemakitten :: PR: #3209
Do not offload grad buffers when training graphs are enabled by @mathemakitten :: PR: #3231
Fix missing PackedSeqParams import by @parthmannan :: PR: #3214
Synchronize the request counts for EP inference with strict matching by @santhnm2 :: PR: #3033
Fix coordinator address collision check in flask by @tdene :: PR: #3208
Do not let requests fail silently inside inference engine by @tdene :: PR: #3228
torch saver inference model offload by @wdykas :: PR: #3170
enable cuda graph ut by @Autumn1998 :: PR: #3197
Support EP with HSDP by @wplf :: PR: #2840
[Main] Add the missing part to support 1F1B overlap for Qwen3-Next by @BestJuly :: PR: #2997
Missing import fix by @parthmannan :: PR: #3241
Miscellaneous inference cleanup (Replay of !2955) by @santhnm2 :: PR: #3232
Add DistributedInitConfig by @maanug-nv :: PR: #3173
Fix checkpoint converter missing parallel group initialization by @yashaswikarnati :: PR: #3217
Skip empty sequences and chunks in MTP tensor roll by @BestJuly :: PR: #3035
Implement get_parameters for ChainedOptimizer by @nschank :: PR: #3201
ci(fix): Create main/dev image tags by @ko3n1g :: PR: #3252
Reapply "Add MTP support for hybrid models (#2363)" by @sancha :: PR: #3207
Fix uv install for GH actions by @Phlip79 :: PR: #3259
Update the project structure in README by @janEbert :: PR: #3251
Cherry-pick: Fix mtp_num_layers and clip_qk issues (#2581, #2776) by @BestJuly :: PR: #3075
RL: training cudagraphs functional test by @mathemakitten :: PR: #3235
[Main] fix cg missing wgrad hook by @Wohox :: PR: #3074
Avoid .cuda call on meta device in LanguageModel by @nschank :: PR: #3202
fix checkpointing error message by @dimapihtar :: PR: #3203
Nano QAT/D fix with sft tokenizer and datasets by @ChenhanYu :: PR: #3254
Revert "fix checkpointing error message (#3203)" by @ko3n1g :: PR: #3283
Reapply "fix checkpointing error message (#3203)" (#3283) by @ko3n1g :: PR: #3285
docs: Add changelog for 0.15.3 by @ko3n1g :: PR: #3286
ci: Set throughput tests as flaky by @chtruong814 :: PR: #3301
chore: Move GB200 tests to nightly by @ko3n1g :: PR: #3302
Ensure type-checker understands use of Submodules in bert_model by @nschank :: PR: #3256
Override extra_repr instead of repr by @nschank :: PR: #3200
Replace ModuleSpec with Protocols for LayerNorm submodules by @nschank :: PR: #3090
Non colocated refit by @wdykas :: PR: #3213
Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training by @xiaoxi-wangfj :: PR: #2763
Add check to prevent MFSDP from numeric issue in gradient accumulate fusion by @shjwudp :: PR: #2904
update get_embedding_ranks and get_position_embedding_ranks docstrings by @c1lovez1 :: PR: #3223
Param offset in _ParamAndGradBucket should be aligned by @skydoorkai :: PR: #3007
ci: Add secrets detector by @chtruong814 :: PR: #3180
Ensure type-checker understands use of Submodules in llava_model by @nschank :: PR: #3257
updates to support modelopt EAGLE training with CP by @yeyu-nvidia :: PR: #3147
fully remove legacy tokenizer system by @dimapihtar :: PR: #2946
M-FSDP: Remove redundant stream waits in HSDP to prevent CG fail by @shjwudp :: PR: #2941
General README and pyproject fixes by @ahmadki :: PR: #2907
chore: More aggressive checkpointing by @ko3n1g :: PR: #3315
ci: Pin down setuptools to lt 82 by @ko3n1g :: PR: #3313
fix: numpy overflow by @ko3n1g :: PR: #3306
fix: T5 dataset by @ko3n1g :: PR: #3307
ci: Revert "ci: Add secrets detector (#3180)" by @chtruong814 :: PR: #3330
ci: Add more tests, run on merge-queue by @ko3n1g :: PR: #3317
ci: Remove merge-gate environment check by @chtruong814 :: PR: #3331
Use FP4 context for mamba by @kwyss-nvidia :: PR: #2604
ci: Ensure we run all functional tests in merge group by @chtruong814 :: PR: #3332
Replace ModuleSpec with Protocols for inputs to MLP by @nschank :: PR: #3084
ci: Fix merge queue functional tests by @chtruong814 :: PR: #3337
ci: skip queue in merge-gate by @ko3n1g :: PR: #3343
ci: Timeout for functional tests by @ko3n1g :: PR: #3346
update checkpointing documentation by @dimapihtar :: PR: #3347
Update golden values to reflect improvements by @tdene :: PR: #3350
BUGFIX: gpt vs hybrid model mtp naming mismatch by @sancha :: PR: #3334
Disable flaky test by @tdene :: PR: #3354
re-enable gpt grpo tests by @jon-barker :: PR: #3348
Fix SFT Pipeline when TP>1 by @asolergi-nv :: PR: #3268
Fixes for KD mode by @AAnoosheh :: PR: #3342
chore: Update codeowners file by @ko3n1g :: PR: #3365
Siddharth/fix inference functional tests by @sidsingh-nvidia :: PR: #3357
Switch oncall by @janEbert :: PR: #3360
Add missing RMSNorm to llama train script by @AAnoosheh :: PR: #3314
Fix inference for MTP models by @tdene :: PR: #3297
Add a logprobs test with real gpt model. by @yobibyte :: PR: #2870
Add simple GRPO functional test by @tdene :: PR: #3323
ci: Concurrency control for merge-queue by @ko3n1g :: PR: #3353
ci: Update golden value download script to work with Github by @chtruong814 :: PR: #3335
fix: correct typos 'seperated' and 'recieved' by @thecaptain789 :: PR: #3305
Improved PyTorch profiler and added PyTorch execution trace by @shengf-nv :: PR: #3273
Removing etc from main index page, shifted name of discussions by @megnvidia :: PR: #3271
build: Bump TE on 2.12 by @ko3n1g :: PR: #3371
ci(hotfix): job conditions by @ko3n1g :: PR: #3376
Record moe routing decisions during inference. by @sidsingh-nvidia :: PR: #3034
[Main] Fix EP Overlap Bugs for Full-Iter CG by @Wohox :: PR: #3164
Avoid direct pickle import by @maanug-nv :: PR: #3375
Delete old pretrain_* files by @Phlip79 :: PR: #3359
Add Qwen3-VL support with Megatron-FSDP by @xuwchen :: PR: #2841
Refactor Mamba chunked prefill by @santhnm2 :: PR: #3265
Improved parallel logging of learning rate by @jstjohn :: PR: #3319
Add enhanced event tracking with TTFT measurement and compact serialization. by @lmcafee-nvidia :: PR: #3253
Add assertion that max_requests is divisible by tp_size by @santhnm2 :: PR: #3304
Move to using the Inference OpenAI API server by @ArEsKay3 :: PR: #3107
Update moe github test cases. by @Victarry :: PR: #3077
Split layer_specs to return Submodules instead of ModuleSpecs by @nschank :: PR: #3255
ci: Remove gpu sanity check by @chtruong814 :: PR: #3420
[Critical-Bug] Fix Uneven PP for Mamba models (Nemotron3-nano) by @kevalmorabia97 :: PR: #3399
Fix for rl by @shanmugamr1992 :: PR: #3390
Add check for full_iteration scope before instantiating CudaGraphManager by @vasunvidia :: PR: #3362
Fix broken links throughout by @megnvidia :: PR: #3230
Decouple topk and loss from DSA Indexer by @kunlunl :: PR: #3248
Extract intermediate embeddings of transformer block by @sajadn :: PR: #3060
Move to using the Inference OpenAI API server (bis) by @tdene :: PR: #3395
Make Mamba inference state memory ratio configurable by @santhnm2 :: PR: #3322
Fix configs for RL model environments by @tdene :: PR: #3441
Replace pickle with json in rl_utils by @tdene :: PR: #3351
fix: correct typo in demo training example by @dndnda :: PR: #3428
Clean up logging inside inference flask server by @tdene :: PR: #3437
ci: Update release-docs workflow to use FW-CI-templates v0.72.0 by @chtruong814 :: PR: #3438
Fix --tokenizer-hf-include-special-tokens by @jon-barker :: PR: #3422
Update num_tokens_to_generate default for Gym by @tdene :: PR: #3453
Fix slowdown in inference flask server by @tdene :: PR: #3445
Add a normalized scale for MTP per token loss by @BestJuly :: PR: #3159
[Bugfix] Fix nan loss caused by zero token in MTP by @BestJuly :: PR: #3396
Log RL metrics per environment by @yobibyte :: PR: #3446
Move tensor offload/onload out of RL code by @tdene :: PR: #3029
Fix another inference flask / Gym interaction by @tdene :: PR: #3467
Add Engine event to the follow up requests after checkpointing by @ArEsKay3 :: PR: #3473
adding in copyright blurb at the top of md file by @megnvidia :: PR: #3394
[Megatron-FSDP] Add fsdp_all_gather_in_start_param_sync option in DDP Config by @shjwudp :: PR: #3095
ci: Update release workflow to include changelog and publish docs by @chtruong814 :: PR: #3472
ci(fix): Weekly GPT tests by @ko3n1g :: PR: #3443
ci: Remove environments by @ko3n1g :: PR: #3462
update HF tokenizer defaults by @dimapihtar :: PR: #3440
ci: Bump preflight to detect our svc by @ko3n1g :: PR: #3494
build: Drop Python 3.10 support and pip install one-logger by @ko3n1g :: PR: #3485
PTQ changes for upcoming QAD by @AAnoosheh :: PR: #3124
ci: Bump pre-flight for Bot SSO by @ko3n1g :: PR: #3497
Revert "build: Drop Python 3.10 support and pip install one-logger (#...

@ko3n1g

Changelog Details

cp: ci: Skip cleanup-taint-node jobs during deployments (3612) into core_r0.16.0 by @ko3n1g :: PR: #3613
beep boop 🤖: Bumping versions by @svcnvidia-nemo-ci :: PR: #3616
cp: docs: Fix version picker urls (3621) into core_r0.16.0 by @ko3n1g :: PR: #3622
cp: ci: Increase changelog generation max PRs fetched (3620) into core_r0.16.0 by @ko3n1g :: PR: #3623
Cherry-pick #3399 for Mamba Uneven PP fix by @kevalmorabia97 :: PR: #3544
cp: fix: async_utils: explicit GC in persistent checkpoint worker loop (3591) into core_r0.16.0 by @ko3n1g :: PR: #3628

@ko3n1g

Changelog Details

ci: Fix copyright checker by @ko3n1g :: PR: #1893
chore: Add codeowners by @ko3n1g :: PR: #1897
ci: Extend queue-manager for dev branch by @ko3n1g :: PR: #1906
ci: Move test optimizer into its own bucket by @ko3n1g :: PR: #1909
ci: Configure cherrypick bot by @ko3n1g :: PR: #1925
Ci approve dev by @ko3n1g :: PR: #1933
ci: Update nightly schedule by @ko3n1g :: PR: #1934
ci: Bump pre-flight for runs on main/dev by @ko3n1g :: PR: #1935
ci: Allow skipping on main by @ko3n1g :: PR: #1936
Ko3n1g/ci/pr template community bot by @ko3n1g :: PR: #1937
ci: More granular unit tests buckets by @ko3n1g :: PR: #1932
Add sequence packing to RL by @tdene :: PR: #1911
chore: Update template by @ko3n1g :: PR: #1939
chore: Add description about who can merge by @ko3n1g :: PR: #1940
Ko3n1g/ci/fix main on eos by @ko3n1g :: PR: #1938
Ko3n1g/ci/internal mrs by @ko3n1g :: PR: #1942
ci: Fix branch of approval bot by @ko3n1g :: PR: #1944
ci: Approvalbot for other branches by @ko3n1g :: PR: #1947
ci(fix): Approval bot by @ko3n1g :: PR: #1949
Ko3n1g/ci/sync branches by @ko3n1g :: PR: #1956
Ko3n1g/ci/add milestone by @ko3n1g :: PR: #1951
Remove M-FSDP testing under LTS environment by @shjwudp :: PR: #1959
ci: Run on push to release branch by @ko3n1g :: PR: #1960
Fix typo in rl section of CODEOWNERS by @tdene :: PR: #1968
ci: Update copyright checker by @ko3n1g :: PR: #1973
Ko3n1g/ci/auto reminder GitHub by @ko3n1g :: PR: #1955
ci(fix): Run tests label by @ko3n1g :: PR: #1970
Make get_asyncio_loop safe to use repeatedly by @tdene :: PR: #1990
chore: Update codeowners by @ko3n1g :: PR: #2012
zarr soft deprecation by @dimapihtar :: PR: #2004
Deduplicate dynamic engine + coordinator. by @lmcafee-nvidia :: PR: #1981
Update symmetric registration interface to sync-up with upstream pytorch change by @youngeunkwon0405 :: PR: #1924
Safely access state dict args in load ckpt by @maanug-nv :: PR: #1957
Allow mixed-batch sampling in dynamic inference by @tdene :: PR: #1927
Stop Nemo_CICD_Test from failing in forks by @tdene :: PR: #2024
Clean up dynamic inference step by @tdene :: PR: #1992
ci: Auto-update copy-pr-bot vetters by @ko3n1g :: PR: #1850
ci: Fix build-push-wheel workflow by @ko3n1g :: PR: #2022
ci: Enable integration tests by @ko3n1g :: PR: #2023
chore: Update tooling for interactive jobs by @ko3n1g :: PR: #2032
Have datasets account for tokenizers which incorrectly define PAD by @tdene :: PR: #2017
revert(hotfix): ci: trustees_override by @ko3n1g :: PR: #2041
add missing warnings import in model parallel config by @yashaswikarnati :: PR: #2039
Reduce-scatter implementation with FP32 accumulation by @deepakn94 :: PR: #1967
ci(fix): Workflows on main by @ko3n1g :: PR: #2045
build: Bump modelopt by @ko3n1g :: PR: #2046
Remove TestCaptureFreezeGC unit test. by @lmcafee-nvidia :: PR: #1978
ci: Add multi-approval action by @ko3n1g :: PR: #2051
Ko3n1g/ci/test iteration time by @ko3n1g :: PR: #2067
Allow inference test throughput to vary by 10% by @mathemakitten :: PR: #2070
chore: Fix autoformatter by @ko3n1g :: PR: #2073
ci(hotfix): Bypass approvalbot in merge-queue by @ko3n1g :: PR: #2082
chore: Update local tooling by @ko3n1g :: PR: #2066
Add extra RL files by @tdene :: PR: #2077
Prevent summary jobs from running in forks by @tdene :: PR: #2083
ci: Fix test scope by @ko3n1g :: PR: #2091
Refactor the attention metadata into separate classes by @kanz-nv :: PR: #2001
Guard against incorrectly using MoE prefill graphs by @tdene :: PR: #2030
Run mr-slim tests in lightweight-mode by @chtruong814 :: PR: #2106
Inference | Lazy compile UVM allocator. by @lmcafee-nvidia :: PR: #1977
chore: Reenable trustees by @ko3n1g :: PR: #2108
Ko3n1g/chore/update release settings by @ko3n1g :: PR: #2097
ci(fix): Changeset of copyright checker by @ko3n1g :: PR: #2110
Remove unnecessary check on rotary_pos_cos by @santhnm2 :: PR: #2003
(Reverted) Inference | Lazy compile UVM allocator. by @lmcafee-nvidia :: PR: #2125
Refactor Attention Metadata to Separate Classes by @kanz-nv :: PR: #2112
Refactor model_provider to model_builder format for ModelOpt examples by @AAnoosheh :: PR: #2107
wandb Inference stats logging by @wdykas :: PR: #2026
Make PipelineParallelLayout always return str from __repr__ by @ananthsub :: PR: #2055
Add flash_attn_3 as first option for FA3 import by @santhnm2 :: PR: #2010
Add debugging hint for case when cudagraphs are created but no matching runner is found by @mathemakitten :: PR: #2129
ci: LTS container by @ko3n1g :: PR: #2133
Fix param init by @cuichenx :: PR: #2033
Hotfix to unit tests on hopper FA3 by @tdene :: PR: #2143
Add BytesIO to safe_globals by @tdene :: PR: #2074
add deprecation warning for legacy tokenizer system by @dimapihtar :: PR: #2145
replay: ci: Bump LTS container by @ko3n1g :: PR: #2157
Hotfix to unit tests on hopper FA3 (bis) by @tdene :: PR: #2179
Fix has_modelopt_state() for native Torch checkpoint format by @AAnoosheh :: PR: #2160
chore: Remove codeowners by @ko3n1g :: PR: #2175
Fix FP8 inference with sequence parallelism by @santhnm2 :: PR: #2009
Replace ModelOpt generation server by @AAnoosheh :: PR: #2147
Add hybrid model support for dynamic inference engine by @santhnm2 :: PR: #1907
Async task and event loop safety in Megatron Core by @tdene :: PR: #2025
Rename skip_prompt_log_probs by @tdene :: PR: #2181
Dynamic inference context | UVM only. by @lmcafee-nvidia :: PR: #1983
ci: Run auto-update-copy-pr-bot only on forks by @ko3n1g :: PR: #2191
Inference throughput tests: refactor goldens to be in list format by @mathemakitten :: PR: #2072
Enable TE custom quantization recipe by @negvet :: PR: #2005
Add MoE parameters to ModelOpt pruning example + conf fixes by @kevalmorabia97 :: PR: #2205
Add repr to pg collection class by @yashaswikarnati :: PR: #2089
Move data_samplers.py from legacy to training.datasets & add DistributedSignalHandler to DataLoader workers by @asolergi-nv :: PR: #2068
Fix Megatron-FSDP checkpoint save failure by @shjwudp :: PR: #2138
Fix moe CODEOWNERS. by @jaredcasper :: PR: #2200
chore: Update LICENSE by @ko3n1g :: PR: #2219
remove megatron.training dependency from megatron.core for FSDP checkpoint with EP by @ananthsub :: PR: #2113
Tensorize dynamic inference mixed sampling by @tdene :: PR: #2105
Add unit test for inference DP coordinator by @tdene :: PR: #2187
Inference linear layer by @sidsingh-nvidia :: PR: #1908
chore: Prefer Nvidia email addresses for reminder bot by @ko3n1g :: PR: #2221
[Megatron-FSDP] Fix hang caused by non-deterministic reduce-scatter by @shjwudp :: PR: #2218
Remove qwen symlink to fix for case-insensitive FS by @kevalmorabia97 :: PR: #2235
Optimizer refactor: clean up public get_megatron_optimizer interface and provide a more general API to support passing in different hyperparameters to subsets of parameters by @deepakn94 :: PR: #2047
Fix CI for PR#1983 by @lmcafee-nvidia :: PR: #2245
Fix aux-loss logging for hybrid models by @deepakn94 :: PR: #2197
Update flops calculation (for throughput) for hybrid MoEs by @deepakn94 :: PR: #2198
Enable kv cache in training for eagle by @yeyu-nvidia :: PR: #1895
Tensorize dynamic inference mixed sampling (bis) by @tdene :: PR: #2231
chore: Fix codeowners by @ko3n1g :: PR: #2264
Allow loading checkpoint from iteration 0 by @ananthsub :: PR: #2199
ci: Skip install test in merge queue by @chtruong814 :: PR: #2281
Add MoE layer type to hybrid models by @deepakn94 :: PR: #2259
Add the Hybrid-EP backend to the Flex Dispatcher by @Autumn1998 :: PR: #2176
[MAIN][NVFP4] Support NVFP4 MOE with Proper Padding by @zhongbozhu :: PR: #1985
Update ModelOpt example readmes and advanced usage by @kevalmorabia97 :: PR: #2273
Fix UVM compatibility with CUDA 13. by @lmcafee-nvidia :: PR: #2243
ci: Add flaky marker to LTS tests by @ko3n1g :: PR: #2290
Dynamic engine suspend/resume via prefill. by @lmcafee-nvidia :: PR: #1982
fix: Pass the timeout argument for the EP group by @yanring :: PR: #2268
JIT for MoE router and preprocess by @yaox12 :: PR: #1919
Hotfix to CI, until the fix gets reviewed by @tdene :: PR: #2298
Add functional test for DP coordinator throughput by @tdene :: PR: #2189
Add asyncio Queue like in Python 3.13 by @tdene :: PR: #2224
Fixes for PR#1982 by @lmcafee-nvidia :: PR: #2303
Fix PP KV cache allocation and enable multi-node PP inference by @santhnm2 :: PR: #2182
Revert active-buffer-size-gb arg name. by @lmcafee-nvidia :: PR: #2257
feat: check: api backwards compatibility by @pablo-garay :: PR: #2251
Add MambaInferenceStateConfig dataclass by @santhnm2 :: PR: #2265
Fix typo in inference example by @santhnm2 :: PR: #2311
feat: initialization of API backward compatibility verification by @pablo-garay :: PR: #2310
Fix Mamba TP and remove confusing legacy initialization by @jaredcasper :: PR: #2202
Refactor KD to use ModelOpt plugins file by @AAnoosheh :: PR: #2305
Fix dynamic context syntax and remove redundant tensors by @kanz-nv :: PR: #2336
Improve asyncio exception handling by @tdene :: PR: #2300
ci: Upload to testpypi only on main by @ko3n1g :: PR: #2342
implement graph config by @kanz-nv :: PR: #2203
feat: required check adjustment by @pablo-garay :: PR: #2350
fix: load iteration 0 for release checkpoints by @ananthsub :: PR: #2351
Explicitly zero out padding token activations for dynamic inference by @santhnm2 :: PR: #2008
Bugfix for Mamba with Chunked-Prefill by @sidsingh-nvidia :: PR: #2293
Break apart dynamic inference step into 2 methods by @tdene :: PR: #2192
Prevent unnecessarily overwriting the default Hugging Face chat te...

This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at PSIRT@nvidia.com

@marksverdhei

Features
- Performance
  - Fused QKV preprocessing with precomputed RoPE caches (3x preprocessing speedup, 10-14% E2E) (MR !3912)
  - Use new TE interface for user buffers (MR !3886)
  - Add CPU activation offloading via TE (MR !4286)
  - Add setting to support Adam or AdamW optimizer (MR !3866)
- MoE
  - Add DTensor support for EP and DSv3 modules (MR !3955)
  - Add HybridEP backend to Flex Dispatcher (PR !2176)
  - Implement NVFP4 Zero Padding for MoE (PR !1985)
  - Compute shared experts before router (MR !4068)
  - Enable bias in expert MLP (MR !3858)
- Model support
  - Add YaRN support for GPT-OSS (MR !4044)
  - Add FP8 init for MTP (MR !3958)
  - Add fp8_dpa option for FP8 scaling (MR !4053)
- FSDP
  - Enable joint training of parallel modules (MR !3850)
- Inference
  - Add CUDA Graph runner lookup table cache (up to 2x E2E speedup) (MR !4082)
  - Add MoE dropping and padding router for CUDA Graph + decode (MR !3816)
  - Integrate unified memory for dynamic inference context (MR !3985)
- Post-training
  - Add GPT-OSS ModelOpt support with quantization, import/export (MR !4169)
  - Enable KD support with hybrid training loop (MR !4021)
  - Add ModelOpt pruning example (MR !4022)
- RL
  - Add importance sampling and partial rollouts to Megatron RL (MR !4000)
  - Add sequence packing for RL (MR !4191)
- Ease of use
  - Handle CUDA absence during import (MR !4120)
  - Enable SWA mixing with attention (MR !3855)
Bug fixes
- Fix convergence bug in MXFP8 parameter gradient buffer reuse (MR !3999)
- Fix loss mask cloning to prevent incorrect updates (MR !4164)
- Fix metadata loss in checkpoints (MR !4182)
- Fix FSDP grad accum fusion support (MR !4018)
- Fix non-TE optimizer checkpoint issue (MR !3931)
- Fix BERT virtual pipeline parallelism (MR !3993)
- Fix gc.freeze() slowdown by adding gc.collect() on last layer (MR !4003)
- Fix full iteration CUDA graph non-tensor handling (MR !4019)
- Fix model_auto_sync mis-set and add gradient assertion (MR !4062)
- Fix HF import dtype and checkpoint loading issues (MR !4095)
- Fix missing initialization in ProcessGroupCollection (MR !4159)
- Fix sink attention TP (MR !4173)
- Fix 1f1b overlap unit tests for MTP standalone (MR !4210)
- Fix stale state dict handling (MR !4226)
Known issues
New Contributors
- @marksverdhei made their first contribution in #1980
- @Skylion007 made their first contribution in #2047
- @azzhipa made their first contribution in 5db6704
- @vicoooo26 made their first contribution in 5db6704
- @A-transformer made their first contribution in e002b5c
- @chaitanyadwivedii made their first contribution in 20b3954

We'd like to thank all our external contributors whose work was merged in this release:

External Contributor Acknowledgements
- Fix ImportError and NameError in examples/run_simple_mcore_train_loop.py by @marksverdhei in #1980
- Optimizer refactor: clean up public get_megatron_optimizer interface by @Skylion007 in #2047
- Typo fixes from community with co-authors @vicoooo26, @azzhipa, @A-transformer in 5db6704 and e002b5c
- Fix router input jitter dtype by @chaitanyadwivedii in 20b3954

Note: Some contributions came through internal MRs and use commit hashes instead of PR numbers. We are now GitHub first so all PRs moving forward will be tested and merged in public.

Releases: NVIDIA/Megatron-LM

NVIDIA Megatron Core 0.17.1

Contributors

Uh oh!

26.04-alpha.rc2

Uh oh!

26.04-alpha.rc1

Uh oh!

NVIDIA Megatron Core 0.17.0

Contributors

Uh oh!

NVIDIA Megatron Core 0.16.1

Contributors

Uh oh!

NVIDIA Megatron Core 0.16.0

Contributors

Uh oh!

NVIDIA Megatron Core 0.15.3

Uh oh!

NVIDIA Megatron Core 0.15.2

Uh oh!

NVIDIA Megatron Core 0.15.1

Uh oh!

NVIDIA Megatron Core 0.15.0

Contributors

Uh oh!