Roadmap
Hi Dynamo developers!
We wanted to share the Dynamo roadmap for the next three releases (1.2 - 1.4). The Dynamo ecosystem has grown considerably. We recommend reading through the high-level plan first to get an overview, then using this roadmap to look up specific components of interest :)
Table of Contents
Timeline
| 📦 Release |
1.2.0 |
1.3.0 |
1.4.0 |
| 🗓️ Planned date |
5/27/26 |
7/15/26 |
8/12/26 |
Performance
- Performance & Benchmarking
- [1.2] Kimi K2.5 multi-feature coding benchmark: One agentic-coding workload, run end-to-end across Dynamo's stack
- Performance: disaggregated serving, KV-aware routing, Eagle-3 speculative decoding
- Resilience: sub-second failover with in-flight request migration
- Elasticity: rapid scale-up via fast weight loading. Reproducible recipe and traffic dataset published in AIPerf.
- [1.2] DeepSeek-V4 family coverage:
- Functional Hopper & Blackwell recipes across vLLM & SGLang.
- [1.3] DeepSeek-V4 optimization:
- Performant, multi-feature recipes for DeepSeek V4 models on both Hopper & Blackwell, focused on agentic coding use cases and leveraging KV Cache optimizations and disaggregated serving.
- [1.3] Use-case optimized recipes for Hopper + Blackwell across TRT-LLM / vLLM / SGLang
- Top 3 large LLMs - Long-context agentic + production-SLA shapes
- Top 3 small LLMs - High-throughput short-prompt + multi-turn chat shapes.
- [1.4] Use-case optimized recipes for Hopper + Blackwell across TRT-LLM / vLLM / SGLang
- Expand recipes to Omni models and VLM models, focused on VLM multi-turn chat and streaming audio.
- [1.3] Conditional Prefill/Decode Disaggregation
- Standard disaggregated serving sends every request along a fixed prefill->decode pipeline, even when the cache state would make it cheaper to perform prefill+decode on the same worker.
- Dynamo's conditional P/D evaluates each request individually, using prefill workers only when it is more efficient.
- Routing is policy-driven rather than role-fixed, and any worker can serve as prefill or decode on demand, letting the system rebalance as traffic shape changes without paying a model-restart penalty.
- AIConfigurator
- [1.2] Agentic AIC Coverage Improvement
- Shift from manual to agent-powered data collection for more rapid expansion of model/hardware/framework coverage, including broader Wide Expert Parallelism (WideEP) support to keep pace with frontier MoE deployments.
- [1.3] AIConfigurator Dynamo Mocker integration
- Expand simulation capabilities beyond disaggregated serving. AIConfigurator can simulate Dynamo performance on dynamic traffic, model Dynamo's KV-cache optimizations, and add pipeline parallelism (PP) simulation coverage. Forward-Pass Metrics from engines will also accelerate new model support by enabling AIC performance modeling without requiring custom op collectors for every new model path.
- [1.4] DGDR Dynamo configurations recommendation
- Using Forward-Pass Metrics streamed from live engines, AIConfigurator can continuously fine-tune its simulation in place, enabling Planner to make accurate live-scaling decisions against an AIC-backed performance model for specific customer traffic.
Scaling
-
Planner: Intelligent autoscaler to adhere to SLAs
- [1.2 - 1.4] Planner Deployability
- ForwardPassMetrics across all backends (vLLM, SGLang, TRT-LLM) flowing into Planner decisions
- Planner plugins: allow for custom scaling logic via plugins (cost-aware scaling, spot instance draining, etc.)
- Bring your own trace for Planner to understand your workload & learn from production traffic replay
- [1.3] Global Planner for optimizing heterogeneous worker pools
- GlobalPlanner topology for coordinating multiple local planners across disaggregated pools
- [1.3-1.4] Agentic Planner
- Dynamo-native workflow profiler that learns structured agent traces, produces a workflow profile artifact, and uses it at runtime for routing, priority, KV, prefetch, and admission hints
- Scheduling of agents, sub-agents, and tool calls with priority-aware placement
- Placement of agents to appropriate GPU SKU or heterogeneous HW
- [1.3-1.4] Planner for RL
- Multi-pool routing for extremely varying sequence lengths in RL rollouts (short-context + long-context pools with intelligent selection)
- Trainer <> Inference <> Environment rate matching to prevent idle hardware
- NATS scaling contract stabilization for non-K8s environments (veRL, SLIME integration)
- [1.4] Self-Tuning Disaggregation
- Planner-tuned Remote Prefill Policy: adjusts local-vs-remote prefill decisions live based on prefill-worker load
- [1.3-1.4] Multi-Cluster
- Lightweight global coordinator for SLO-based request spillover, policy-version fencing, and weight propagation (Global Planner decides where to send weights via ModelExpress)
-
Grove: Kubernetes-native AI inference orchestration
- [1.3] Topology Aware Serving
- Full topology-aware serving support across multi-node, multi-GPU configurations
- Multi-topology support for serving multiple models with different topology/HW requirements on the same cluster
- Preferred topology API to let workloads express topology preferences ("I want Prefill and Decode on NVSwitch-connected hardware")
- [1.2 - 1.4] Scheduler Backend Framework
- Compatibility with default kube-scheduler, KAI, Volcano, Koordinator via plugins
- [1.2] Rolling Upgrades & Autoscaling
- Support for rolling upgrade strategies that don't trigger full pod-gang rolls
- Direct Grove + Planner integration for autoscaling
- [1.3 - 1.5] Job Support
- Supporting terminal states & gang termination
- Bin-pack GPUs with jobs + inference deployments to support cross-workload scheduling
- [1.3] Agents
- Sub-agent colocation for best performance on agentic workloads
- Sandbox scheduling with topology-awareness
- [1.5] Heterogeneous HW Support
- Support for GPU<>LPU scheduling in Kubernetes
- Scheduling of jobs and deployments across multiple clusters and data-centers, regardless of scheduler/Kubernetes environments
-
Fault Tolerance: Built in inference recovery and resiliency to keep Dynamo deployments stable and recoverable in production.
- [1.2 - 1.3] GPU Memory Service
- Productionizing inter-pod GPU memory service, allowing failed inference workers to recover in <5s via shadow failover switchover with GMS + warm compile cache
- [1.3] Dynamo Snapshot: CRIU-based GPU process snapshots
- Targeting performant checkpoint/restore of multi-GPU CRIU snapshots
- Shadow checkpoints scale linearly with node count, not combinatorially with failure permutations
- [1.3 - 1.4] WideEP
- Fail-Continue mode: single GPU failure -> healthy ranks continue serving; in-flight requests retry/migrate transparently
- Zero-Downtime Recovery: replacement GPU rejoins at rank granularity via GMS zero-copy weight remap, to ensure healthy ranks never stop serving
- [1.2 - 1.4] Request Lifecycle Hardening
- Rejection layer refactoring to reject requests pre-tokenization
- SLA-tiered request handling, allowing migration of in-flight requests to different pools or cancellation of requests that break p95 SLA
- Token-level request migration for RL rollouts to ensure long-running requests survive failures
-
Model Express: Fast GPU-to-GPU weight transfer over RDMA for autoscaling, fault tolerance, RL weight sync, and model lifecycle management
- [1.2] Production Stability for Inference
- NIXL registration optimization: per-tensor -> allocation-level registration
- Registry backend rewrite for stateless horizontal scaling (SQLite -> Redis + K8s CRD)
- Unified loader cascade: single --load-format mx flag covering P2P RDMA -> GDS -> HuggingFace default fallback
- MX as default model loading path in Dynamo recipes
- [1.2 - 1.4] RL Weight Transfer
- Broadcast support for high-fan-out weight refresh across inference replicas during RL training
- Resharding: transfer versioned distributed tensors across different train/inference layouts (e.g., training in EP8DP2, rollout in TP2DP4) without CPU bottlenecks
- Validate MX across major RL frameworks (veRL, SLIME, Prime-RL, NeMo RL)
- [1.2 - 1.4] Ecosystem Integrations
- Allowing MX to be used with standalone vLLM, SGLang, TRT-LLM via engine loader
- ModelStreamer integration: ModelStreamer handles S3->GPU, MX handles HBM<->HBM
- MatrixHub integration: model weights governed/restricted in MatrixHub, P2P transfers via MX
- [1.2 - 1.4] Platform Foundations
- torch.compile cache sharing across replicas for cold-start optimization
- Cross-cluster weight propagation for multi-cluster/data-center deployments
- S3/object store backend for CSP-native model storage
-
Reinforcement Learning: Turning Dynamo into the rollout engine for RL post-training
- [1.2] TITO Contract (Token-In / Token-Out)
- Standardized inference contract for RL rollouts: logprobs passthrough, tokenize/detokenize endpoints, session-aware + KV-aware routing
- Targeting integrations with veRL, Prime-RL, NeMo RL, Miles, Relax
- [1.2 - 1.4] RL-Aware Routing & Scheduling
- KV-aware load balancing across rollout workers to prevent stragglers and improve rollout throughput
- Multi-pool routing for variable-length RL rollouts: short-context and long-context pools with intelligent sequence-length-based selection and autoscaling
- Supporting agentic rollouts via cache_hints + G2 offloading
- [1.2 - 1.3] Weight Transfer for RL
- High-frequency weight sync from trainer to rollout workers via ModelExpress (NIXL), replacing framework-specific NCCL/UCX glue
- Supporting multi-cluster/data-center weight updates on heterogeneous hardware
- [1.2 - 1.3] LoRA-Aware Routing
- Two-stage routing: LoRA adapter placement followed by KV/load scoring
- Enables multi-tenant RL where different policy variants share the same base model with adapter-level isolation
Non-LLM
Core & Agents
- Routing
- [1.2] Hybrid Mamba model support (Qwen3.5, Nemotron)
- Approximate mode router benchmarking across round-robin, KV approx, and KV-aware modes
- Optimized performance for the Qwen3.5 and Nemotron families
- [1.3] Multi-tenant isolation
- Multi-cluster routing (see Planner section)
- KVBM
- [1.4] Performant CMX and remote storage KV offloading
- [1.4] SGLang support
- [1.5] P2P distributed KV cache
- Agents
- [1.2] Expand agentic hints support to improve agent performance.
- [1.4] Profiling for static agentic workflows via agentic planner
- [1.4] Pause agents waiting for tool calls or the next LLM turn when the system is under pressure, then restore them later.
- [1.4] KV cache manipulation (offload to colder storage and prefetch when needed)
- [1.5] CPU orchestration for agents
Roadmap
Hi Dynamo developers!
We wanted to share the Dynamo roadmap for the next three releases (1.2 - 1.4). The Dynamo ecosystem has grown considerably. We recommend reading through the high-level plan first to get an overview, then using this roadmap to look up specific components of interest :)
Table of Contents
Timeline
Performance
Scaling
Planner: Intelligent autoscaler to adhere to SLAs
Grove: Kubernetes-native AI inference orchestration
Fault Tolerance: Built in inference recovery and resiliency to keep Dynamo deployments stable and recoverable in production.
Model Express: Fast GPU-to-GPU weight transfer over RDMA for autoscaling, fault tolerance, RL weight sync, and model lifecycle management
Reinforcement Learning: Turning Dynamo into the rollout engine for RL post-training
Non-LLM
Multimodality
Diffusion
Generative Recommendation
Voice
Triton Migration
Core & Agents