Skip to content

Releases: NVIDIA/Megatron-LM

NVIDIA Megatron Core 0.17.1

28 May 20:31

Choose a tag to compare

Changelog Details
  • beep boop 🤖: Bumping versions by @svcnvidia-nemo-ci :: PR: #4349
  • cp: NVFP4 native weights for DDP (4005) into core_r0.17.0 by @ko3n1g :: PR: #4290
  • docs: bump project.json and versions1.json to 0.17.0 by @ko3n1g :: PR: #4361
  • [docs] ci: fix version picker in 0.17.0 docs by @ko3n1g :: PR: #4363
  • [docs] ci: use parent-relative json_url for version picker by @ko3n1g :: PR: #4366
  • Backport NVRx async checkpoint compatibility to core_r0.17.0 by @sbak5 :: PR: #4453
  • cp: add permute fusion into hybrid ep (4089) into core_r0.17.0 by @ko3n1g :: PR: #4488
  • cp: get rid of weights_only=False (4434) into core_r0.17.0 by @ko3n1g :: PR: #4554
  • cp: SafeUnpickler class for safe pickle usage (4319) into core_r0.17.0 by @ko3n1g :: PR: #4555
  • cp: checkpoint integrity verification (4305) into core_r0.17.0 by @ko3n1g :: PR: #4556
  • fix(async_ckpt): import inspect in async_utils on core_r0.17.0 by @ko3n1g :: PR: #4597
  • chore(beep boop 🤖): Bump uv.lock (core_r0.17.0) (2026-05-04) by @svcnvidia-nemo-ci :: PR: #4598
  • cp: fix: Replace polynomial rolling hash with SHA-256 for prefix caching (#4158) by @chtruong814 :: PR: #4612
  • build: relax transformers cap to <=5.3.0 on core_r0.17.0 by @ko3n1g :: PR: #4701
  • chore: Bump TE to latest 2.14 by @chtruong814 :: PR: #4772
  • cp: additional tests for nvrx (#4522) by @chtruong814 :: PR: #4826
  • Release 0.17.0 by @ko3n1g
  • Bump mfsdp to 0.4.0 by @ko3n1g
  • cp: NVFP4 native weights for DDP (4005) into core_r0.17.0 (#4290) by @ko3n1g
  • docs: bump project.json and versions1.json to 0.17.0 (#4361) by @ko3n1g
  • [docs] ci: fix version picker in 0.17.0 docs (#4363) by @ko3n1g
  • [docs] ci: use parent-relative json_url for version picker (#4366) by @ko3n1g
  • chore(beep boop 🤖): Bump (core_r0.17.0) (2026-04-20) by @github-actions[bot]
  • Backport NVRx async checkpoint compatibility to core_r0.17.0 (#4453) by @sbak5
  • add permute fusion into hybrid ep (#4089) by @Autumn1998
  • Merge pull request #4488 from NVIDIA/cherry-pick-4089-core_r0.17.0 by @ko3n1g
  • get rid of weights_only=False (#4434) by @dimapihtar
  • SafeUnpickler class for safe pickle usage (#4319) by @dimapihtar
  • checkpoint integrity verification (#4305) by @dimapihtar
  • Merge pull request #4554 from NVIDIA/cherry-pick-4434-core_r0.17.0 by @ko3n1g
  • Merge pull request #4555 from NVIDIA/cherry-pick-4319-core_r0.17.0 by @ko3n1g
  • Merge pull request #4556 from NVIDIA/cherry-pick-4305-core_r0.17.0 by @ko3n1g
  • fix(async_ckpt): import inspect in async_utils on core_r0.17.0 (#4597) by @ko3n1g
  • chore(beep boop 🤖): Bump uv.lock (core_r0.17.0) (2026-05-04) (#4598) by @svcnvidia-nemo-ci
  • cp: fix: Replace polynomial rolling hash with SHA-256 for prefix caching (#4158) (#4612) by @chtruong814
  • build: relax transformers cap to <=5.3.0 on core_r0.17.0 (#4701) by @ko3n1g
  • chore(beep boop 🤖): Bump (core_r0.17.0) (2026-05-11) by @github-actions[bot]
  • chore: Bump TE to latest 2.14 (#4772) by @chtruong814
  • cp: additional tests for nvrx (#4522) (#4826) by @chtruong814
  • chore(beep boop 🤖): Bump (core_r0.17.0) (2026-05-18) by @github-actions[bot]
  • chore(beep boop 🤖): Bump (core_r0.17.0) (2026-05-25) by @github-actions[bot]

26.04-alpha.rc2

07 May 07:07
99a1081

Choose a tag to compare

[MXFP8 param gather]Post processing after synced param AG in eval (#4…

26.04-alpha.rc1

23 Apr 08:22
85bced0

Choose a tag to compare

[Dev] Add high-priority a2a comm stream option and hybridep preproces…

NVIDIA Megatron Core 0.17.0

16 Apr 19:59
9539a12

Choose a tag to compare

Changelog Details
Read more

NVIDIA Megatron Core 0.16.1

20 Mar 21:24

Choose a tag to compare

Changelog Details
  • cp: ci: Skip cleanup-taint-node jobs during deployments (3612) into core_r0.16.0 by @ko3n1g :: PR: #3613
  • beep boop 🤖: Bumping versions by @svcnvidia-nemo-ci :: PR: #3616
  • cp: docs: Fix version picker urls (3621) into core_r0.16.0 by @ko3n1g :: PR: #3622
  • cp: ci: Increase changelog generation max PRs fetched (3620) into core_r0.16.0 by @ko3n1g :: PR: #3623
  • Cherry-pick #3399 for Mamba Uneven PP fix by @kevalmorabia97 :: PR: #3544
  • cp: fix: async_utils: explicit GC in persistent checkpoint worker loop (3591) into core_r0.16.0 by @ko3n1g :: PR: #3628

NVIDIA Megatron Core 0.16.0

26 Feb 04:17
3bec9aa

Choose a tag to compare

Changelog Details
Read more

NVIDIA Megatron Core 0.15.3

06 Feb 16:30
309ffca

Choose a tag to compare

This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at PSIRT@nvidia.com

NVIDIA Megatron Core 0.15.2

08 Jan 15:42
core_v0.15.2
45b404c

Choose a tag to compare

core_v0.15.2

Megatron-Core v0.15.2

NVIDIA Megatron Core 0.15.1

07 Jan 18:23
core_v0.15.1
512da5d

Choose a tag to compare

core_v0.15.1

Core v0.15.1

NVIDIA Megatron Core 0.15.0

17 Dec 23:08
core_v0.15.0
0d7e02b

Choose a tag to compare

  • Features
    • Performance
      • Fused QKV preprocessing with precomputed RoPE caches (3x preprocessing speedup, 10-14% E2E) (MR !3912)
      • Use new TE interface for user buffers (MR !3886)
      • Add CPU activation offloading via TE (MR !4286)
      • Add setting to support Adam or AdamW optimizer (MR !3866)
    • MoE
      • Add DTensor support for EP and DSv3 modules (MR !3955)
      • Add HybridEP backend to Flex Dispatcher (PR !2176)
      • Implement NVFP4 Zero Padding for MoE (PR !1985)
      • Compute shared experts before router (MR !4068)
      • Enable bias in expert MLP (MR !3858)
    • Model support
    • FSDP
      • Enable joint training of parallel modules (MR !3850)
    • Inference
      • Add CUDA Graph runner lookup table cache (up to 2x E2E speedup) (MR !4082)
      • Add MoE dropping and padding router for CUDA Graph + decode (MR !3816)
      • Integrate unified memory for dynamic inference context (MR !3985)
    • Post-training
      • Add GPT-OSS ModelOpt support with quantization, import/export (MR !4169)
      • Enable KD support with hybrid training loop (MR !4021)
      • Add ModelOpt pruning example (MR !4022)
    • RL
      • Add importance sampling and partial rollouts to Megatron RL (MR !4000)
      • Add sequence packing for RL (MR !4191)
    • Ease of use
      • Handle CUDA absence during import (MR !4120)
      • Enable SWA mixing with attention (MR !3855)
  • Bug fixes
    • Fix convergence bug in MXFP8 parameter gradient buffer reuse (MR !3999)
    • Fix loss mask cloning to prevent incorrect updates (MR !4164)
    • Fix metadata loss in checkpoints (MR !4182)
    • Fix FSDP grad accum fusion support (MR !4018)
    • Fix non-TE optimizer checkpoint issue (MR !3931)
    • Fix BERT virtual pipeline parallelism (MR !3993)
    • Fix gc.freeze() slowdown by adding gc.collect() on last layer (MR !4003)
    • Fix full iteration CUDA graph non-tensor handling (MR !4019)
    • Fix model_auto_sync mis-set and add gradient assertion (MR !4062)
    • Fix HF import dtype and checkpoint loading issues (MR !4095)
    • Fix missing initialization in ProcessGroupCollection (MR !4159)
    • Fix sink attention TP (MR !4173)
    • Fix 1f1b overlap unit tests for MTP standalone (MR !4210)
    • Fix stale state dict handling (MR !4226)
  • Known issues
  • New Contributors

We'd like to thank all our external contributors whose work was merged in this release:

Note: Some contributions came through internal MRs and use commit hashes instead of PR numbers. We are now GitHub first so all PRs moving forward will be tested and merged in public.