Skip to content

[ROADMAP][2026 Q2] Megatron Core Roadmap #4997

@sbhavani

Description

@sbhavani

This roadmap outlines the key Megatron Core features, enhancements, and improvements planned for Q2 2026, with a focus on the June 2026 / 26.06 release. This is a tentative roadmap and subject to change.

For detailed information on past releases, see the Megatron Core release notes. For MoE-specific planning, see MoE Roadmap #4815.


Q2 Roadmap (Target: 26.06 / 0.17)

Parallelism

  • Megatron-FSDP2 - Refactored per-module sharding implementation with an FSDP2-compatible fully_shard() API, communication-compute overlap, activation recompute support, and pooled memory allocation for Megatron Core (#4435).
  • Megatron-FSDP + pipeline parallel compatibility - Continue broader Megatron-FSDP enhancements from the previous roadmap, including PP-compatible support and related sharding refinements (#2302).
  • A2A overlap for Megatron-FSDP - Bridge Megatron-FSDP's hook-based lifecycle with the EP overlap schedule's direct sub-module execution path (#3797).
  • * Hybrid / Dynamic CP - End-to-end Dynamic CP support, including PP and VPP coverage, building on Hybrid Data x Context Parallelism and Dynamic CP work (#2054, #2000).
  • Packed sequence support for GDN - Packed sequence support for gated delta net on main, following the earlier implementation work (#2645, #2644).
  • Permute fusion into Hybrid EP - Fuse permute / unpermute with dispatch / combine into a single kernel path for Hybrid EP (#4089).
  • Muon + Megatron-FSDP support - Preliminary Muon + Megatron-FSDP support (#4486).

MoE

Items from MoE Roadmap #4815

  • sqrtsoftplus MoE score function: Add a new score function sqrtsoftplus to the router.
  • - #3673
  • Distributed Layerwise Optimizer: Adds a shard-aligned parameter layout for LayerWiseDistributedOptimizer that guarantees no parameter is split across shard boundaries. #4509
  • CUDA Graph Interface Refactor: decomposes the overloaded cuda_graph_scope field into three dedicated, semantically distinct concepts and cleans up naming throughout. - #4292
  • Coefficients of Newton-Schulz: bump emerging-optimizer package for deepseek v4 coefficients type
  • - #4523
  • THD Format E2E Support: adds Sequence Packing (THD format) E2E support to MCore #3386

Performance and Memory

  • Batch-processing utilities and PP for SFT - Consolidate get_batch utilities and enable pipeline parallel SFT in THD paths (#4103).
  • Emerging optimizer refactor - Refactor optimizer infrastructure to generalize beyond Muon and support additional emerging optimizers (#4113, #4119).
  • Chunked MLP training - Enable chunked MLP during training, extending prior inference-prefill support (#3656).
  • GroupGEMM + SwiGLU + quantize fused MLP - Transformer Engine fused grouped MLP support, grouped quantized tensor plumbing, checkpoint compatibility, op fuser work, and related MLP infrastructure (#4636).
  • Full-model Cuda Graph with paged stashing - Decouple oversized compute buffers from properly sized activation buffers for backward-pass storage, continuing the CUDA Graph / dropless MoE memory work (#2690, #4247).

Precision

  • MXFP8 improvements and fixes - Fine-grained param gather configuration, forced param all-gather eval fixes, numerical fixes when DP overlap is disabled, and follow-through on MXFP8 FP8-param-gather support (#2582, #4181, #4562, #4800).
  • NVFP4 param gather - Make FP4 param gather work correctly with mixed precision in the NVFP4 recipe (#4358).
  • Low-precision guide - Add a practical guide for low-precision training recipes, trade-offs, and configuration.

Inference

  • New dispatcher for MoE inference - AllGatherV dispatcher for inference and simplification of the previous dispatcher path (#4258).
  • vLLM grouped GEMM backend - Port vLLM grouped GEMM to the inference-optimized MoE backend (#4566).
  • CUDA Graphs for MTP inference - Enable CUDA Graphs for MTP inference (#4260).

Model Architecture and Support

  • * DeepSeek-V4 training support - Support across CSA/HCA, hash MoE, clamped SwiGLU, mHC, MTP, packed sequence, long-context training, optimizer recipe work, FP4 QAT validation, and Bridge examples. Optimizations continue beyond 26.06 into 26.08 (#4468).
  • HybridModel - Next-generation heterogeneous-layer model class intended to replace GPTModel and MambaModel as the foundation for future Nemotron and open-source model families (#4620).
  • Flextron - Post-training method for converting a single parent LLM into a nested family of submodels at different parameter budgets (#4429).

Multimodal

  • MIMO core primitive - Add the core primitive for heterogeneous TP / DP MIMO training, including colocated bridge communication. This is a step toward the broader MIMO goal of early-fusion multimodal architectures with modular vision, audio, and video encoders, independent nD parallelism per module, and colocated or non-colocated training layouts (#1375, #4368).

* : dev branch-first feature. Feature(s) will ship to the dev branch first, then will be merged to main upon validation.


Future Releases and Backlog

Parallelism

  • FSDP checkpoint conversion - Convert between fsdp_dtensor, torch_dist, and Hugging Face formats (#2805).
  • TP-compatible torch.compile - Broader torch.compile compatibility with tensor parallelism remains open, even though Megatron-FSDP + torch.compile work has landed separately (#2598, #2425).

Performance

  • Embeddings / output parameter sharding with LayerWiseOptimizer (#2163).
  • Long context support - Dual Chunk Attention and RoPE ABF variant (#2797).
  • Optimized attention variants - Optimized sparse attention (GDN, NSA, etc) and related Transformer Engine implementations (TransformerEngine #2884, TransformerEngine #2511).
  • Multiple model training in a single process - Efficiency improvements for multi-model workflows.

Model Support

  • Multimodal + diffusion model consolidation - Unify multimodal and diffusion model stacks in Megatron Core by upstreaming and integrating capabilities from NVIDIA-NeMo/DFM (#1592).
  • MiMo-V2-Flash - Hybrid Attention + Fine-Grained MoE support (#2976).
  • Video generation models - Wan and DiT architecture support (#2796).
  • Discrete diffusion language models (#2728).
  • World models - Environment simulation and prediction model support.
  • LLaVA audio / sound support - Scalable audio encoder integration for LLaVA / MIMO.

Inference

  • Dynamic Inference Context for T5 (#3016).
  • Async CPU / GPU compute overlap during dynamic inference (#2019).

Ease of Use

  • Remove global variables from Megatron-LM - Cleaner runtime and configuration architecture (#2315).
  • Model Provider Interface from Megatron Bridge upstreamed to Megatron-LM (#2314).
  • Training loop modularization - Consistent training loop between Megatron-LM and Megatron Bridge.
  • HF tokenizer integration - Native Hugging Face tokenizer support.
  • On-the-fly tokenization for dataloaders (#2727).
  • Per-layer logging and memory estimation - Detailed profiling and capacity-planning support.
  • Improved experiment tracking integrations - Training workflow and logging integrations such as wandb.

Precision

  • FP8 param gather with CPU offloading (#2407).

Infrastructure and Ecosystem

  • Enhanced cross-datacenter training UX (#2795).
  • NCCL GIN support for AWS EFA (#2647).
  • Windows support (#2609).

v0.17 Highlights (Released April 2026)

Parallelism and Distributed Training

  • Hybrid model pipeline improvements - Flexible virtual pipeline parallelism and multi-module 1F1B pipelining for hybrid model layouts (#3377, #3129).
  • MIMO heterogeneous parallelism foundation - Multi-module heterogeneous parallelism, MIMO optimizer support, asymmetric DP communication fixes, distributed checkpointing for non-colocated MIMO, and CP + sequence packing support (#3211, #4019, #4020, #4021, #2135).
  • Megatron-FSDP improvements - HSDP with EP, dtype customization, all-gather in start param sync, mixed-precision policy plumbing, and MXFP8 transpose helper support (#2840, #3067, #3095, #3903, #3992, #4105).

Performance and Memory

  • muP support - Maximal Update Parameterization support and follow-on Muon scaling work (#3058, #3715).
  • MLA and fused kernel improvements - Absorbed MLA, fused MLA down-projection GEMMs, and fused dLN + add backward path (#3198, #3039, #3384).
  • CUDA Graph and optimizer improvements - CUDA Graph support for Adam optimizer and related graph-capture reliability improvements (#4142).
  • Checkpointing and async save hardening - Zero-copy storage sharing, async checkpoint fixes, single-process checkpoint save, NVRx integration, and checkpoint conversion fixes (#3649, #3591, #3633, #3899, #4058).

Inference

  • OpenAI-compatible inference API server - Move inference serving toward OpenAI-compatible API behavior (#3107).
  • KV prefix caching - Prefix caching for attention, hybrid models, Mamba memory, coordinator scheduling, and CUDA Graph support (#3063, #3225, #3657, #3665, #3922).
  • Inference-optimized MoE path - Inference-optimized MoEs and grouped GEMM BF16/MXFP8 support with CUDA Graph compatibility (#3496, #3858).
  • MTP and speculative decoding - Speculative decoding support with MTP layers and MTP inference fixes (#3594, #3297).
  • vLLM ecosystem support - vLLM fakequant export support (#3050).

Model Support and Training

  • mRoPE for MTP - Multi-modal RoPE support for Multi-Token Prediction (#3114).
  • GPT-OSS example with Megatron Bridge - End-to-end Megatron-LM + Megatron Bridge example (#3018).
  • Qwen3-VL with Megatron-FSDP - Qwen3-VL support with Megatron-FSDP (#2841).
  • GDN and Mamba work - Introduce Gated Delta Net support to Mamba (#3535).
  • ModelOpt examples - Additional Nemotron model optimization examples (#3805).

RL, Fine-Tuning, and Developer Experience

  • RL metrics and training flow - Off-policyness tracking, forced lag, RL sequence packing fixes, and Hybrid MoE training CUDA Graph support (#3030, #3515, #3517, #3551, #3373).
  • Python 3.12 migration - Move to Python 3.12 and announce Python 3.10 deprecation (#3826, #3825).
  • Typing and cleanup - Continued migration from ModuleSpec to Protocols, removal of legacy tokenizer / data / MPU paths, and general training-code cleanup (#3084, #3090, #3426, #2946, #3853, #3854).

How to Provide Feedback

We welcome community input on prioritization. Please:

  1. React to items you would like prioritized.
  2. Comment on this issue with use cases, constraints, and hardware / model configurations.
  3. Open focused feature requests with the enhancement label.
  4. Contribute pull requests for roadmap items where possible.

Credits

This roadmap reflects the collective efforts of NVIDIA, external contributors, and the Megatron Core community.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions