Skip to content

[Roadmap]: Dynamo roadmap for 1.2 (May) - 1.4 (Aug) #9178

@harryskim

Description

@harryskim

Roadmap

Hi Dynamo developers!

We wanted to share the Dynamo roadmap for the next three releases (1.2 - 1.4). The Dynamo ecosystem has grown considerably. We recommend reading through the high-level plan first to get an overview, then using this roadmap to look up specific components of interest :)

Table of Contents

Timeline

📦 Release 1.2.0 1.3.0 1.4.0
🗓️ Planned date 5/27/26 7/15/26 8/12/26

Performance

  • Performance & Benchmarking
    • [1.2] Kimi K2.5 multi-feature coding benchmark: One agentic-coding workload, run end-to-end across Dynamo's stack
      • Performance: disaggregated serving, KV-aware routing, Eagle-3 speculative decoding
      • Resilience: sub-second failover with in-flight request migration
      • Elasticity: rapid scale-up via fast weight loading. Reproducible recipe and traffic dataset published in AIPerf.
    • [1.2] DeepSeek-V4 family coverage:
      • Functional Hopper & Blackwell recipes across vLLM & SGLang.
    • [1.3] DeepSeek-V4 optimization:
      • Performant, multi-feature recipes for DeepSeek V4 models on both Hopper & Blackwell, focused on agentic coding use cases and leveraging KV Cache optimizations and disaggregated serving.
    • [1.3] Use-case optimized recipes for Hopper + Blackwell across TRT-LLM / vLLM / SGLang
      • Top 3 large LLMs - Long-context agentic + production-SLA shapes
      • Top 3 small LLMs - High-throughput short-prompt + multi-turn chat shapes.
    • [1.4] Use-case optimized recipes for Hopper + Blackwell across TRT-LLM / vLLM / SGLang
      • Expand recipes to Omni models and VLM models, focused on VLM multi-turn chat and streaming audio.
    • [1.3] Conditional Prefill/Decode Disaggregation
      • Standard disaggregated serving sends every request along a fixed prefill->decode pipeline, even when the cache state would make it cheaper to perform prefill+decode on the same worker.
      • Dynamo's conditional P/D evaluates each request individually, using prefill workers only when it is more efficient.
      • Routing is policy-driven rather than role-fixed, and any worker can serve as prefill or decode on demand, letting the system rebalance as traffic shape changes without paying a model-restart penalty.
  • AIConfigurator
    • [1.2] Agentic AIC Coverage Improvement
      • Shift from manual to agent-powered data collection for more rapid expansion of model/hardware/framework coverage, including broader Wide Expert Parallelism (WideEP) support to keep pace with frontier MoE deployments.
    • [1.3] AIConfigurator Dynamo Mocker integration
      • Expand simulation capabilities beyond disaggregated serving. AIConfigurator can simulate Dynamo performance on dynamic traffic, model Dynamo's KV-cache optimizations, and add pipeline parallelism (PP) simulation coverage. Forward-Pass Metrics from engines will also accelerate new model support by enabling AIC performance modeling without requiring custom op collectors for every new model path.
    • [1.4] DGDR Dynamo configurations recommendation
      • Using Forward-Pass Metrics streamed from live engines, AIConfigurator can continuously fine-tune its simulation in place, enabling Planner to make accurate live-scaling decisions against an AIC-backed performance model for specific customer traffic.

Scaling

  • Planner: Intelligent autoscaler to adhere to SLAs

    • [1.2 - 1.4] Planner Deployability
      • ForwardPassMetrics across all backends (vLLM, SGLang, TRT-LLM) flowing into Planner decisions
      • Planner plugins: allow for custom scaling logic via plugins (cost-aware scaling, spot instance draining, etc.)
      • Bring your own trace for Planner to understand your workload & learn from production traffic replay
    • [1.3] Global Planner for optimizing heterogeneous worker pools
      • GlobalPlanner topology for coordinating multiple local planners across disaggregated pools
    • [1.3-1.4] Agentic Planner
      • Dynamo-native workflow profiler that learns structured agent traces, produces a workflow profile artifact, and uses it at runtime for routing, priority, KV, prefetch, and admission hints
      • Scheduling of agents, sub-agents, and tool calls with priority-aware placement
      • Placement of agents to appropriate GPU SKU or heterogeneous HW
    • [1.3-1.4] Planner for RL
      • Multi-pool routing for extremely varying sequence lengths in RL rollouts (short-context + long-context pools with intelligent selection)
      • Trainer <> Inference <> Environment rate matching to prevent idle hardware
      • NATS scaling contract stabilization for non-K8s environments (veRL, SLIME integration)
    • [1.4] Self-Tuning Disaggregation
      • Planner-tuned Remote Prefill Policy: adjusts local-vs-remote prefill decisions live based on prefill-worker load
    • [1.3-1.4] Multi-Cluster
      • Lightweight global coordinator for SLO-based request spillover, policy-version fencing, and weight propagation (Global Planner decides where to send weights via ModelExpress)
  • Grove: Kubernetes-native AI inference orchestration

    • [1.3] Topology Aware Serving
      • Full topology-aware serving support across multi-node, multi-GPU configurations
      • Multi-topology support for serving multiple models with different topology/HW requirements on the same cluster
      • Preferred topology API to let workloads express topology preferences ("I want Prefill and Decode on NVSwitch-connected hardware")
    • [1.2 - 1.4] Scheduler Backend Framework
      • Compatibility with default kube-scheduler, KAI, Volcano, Koordinator via plugins
    • [1.2] Rolling Upgrades & Autoscaling
      • Support for rolling upgrade strategies that don't trigger full pod-gang rolls
      • Direct Grove + Planner integration for autoscaling
    • [1.3 - 1.5] Job Support
      • Supporting terminal states & gang termination
      • Bin-pack GPUs with jobs + inference deployments to support cross-workload scheduling
    • [1.3] Agents
      • Sub-agent colocation for best performance on agentic workloads
      • Sandbox scheduling with topology-awareness
    • [1.5] Heterogeneous HW Support
      • Support for GPU<>LPU scheduling in Kubernetes
      • Scheduling of jobs and deployments across multiple clusters and data-centers, regardless of scheduler/Kubernetes environments
  • Fault Tolerance: Built in inference recovery and resiliency to keep Dynamo deployments stable and recoverable in production.

    • [1.2 - 1.3] GPU Memory Service
      • Productionizing inter-pod GPU memory service, allowing failed inference workers to recover in <5s via shadow failover switchover with GMS + warm compile cache
    • [1.3] Dynamo Snapshot: CRIU-based GPU process snapshots
      • Targeting performant checkpoint/restore of multi-GPU CRIU snapshots
      • Shadow checkpoints scale linearly with node count, not combinatorially with failure permutations
    • [1.3 - 1.4] WideEP
      • Fail-Continue mode: single GPU failure -> healthy ranks continue serving; in-flight requests retry/migrate transparently
      • Zero-Downtime Recovery: replacement GPU rejoins at rank granularity via GMS zero-copy weight remap, to ensure healthy ranks never stop serving
    • [1.2 - 1.4] Request Lifecycle Hardening
      • Rejection layer refactoring to reject requests pre-tokenization
      • SLA-tiered request handling, allowing migration of in-flight requests to different pools or cancellation of requests that break p95 SLA
      • Token-level request migration for RL rollouts to ensure long-running requests survive failures
  • Model Express: Fast GPU-to-GPU weight transfer over RDMA for autoscaling, fault tolerance, RL weight sync, and model lifecycle management

    • [1.2] Production Stability for Inference
      • NIXL registration optimization: per-tensor -> allocation-level registration
      • Registry backend rewrite for stateless horizontal scaling (SQLite -> Redis + K8s CRD)
      • Unified loader cascade: single --load-format mx flag covering P2P RDMA -> GDS -> HuggingFace default fallback
      • MX as default model loading path in Dynamo recipes
    • [1.2 - 1.4] RL Weight Transfer
      • Broadcast support for high-fan-out weight refresh across inference replicas during RL training
      • Resharding: transfer versioned distributed tensors across different train/inference layouts (e.g., training in EP8DP2, rollout in TP2DP4) without CPU bottlenecks
      • Validate MX across major RL frameworks (veRL, SLIME, Prime-RL, NeMo RL)
    • [1.2 - 1.4] Ecosystem Integrations
      • Allowing MX to be used with standalone vLLM, SGLang, TRT-LLM via engine loader
      • ModelStreamer integration: ModelStreamer handles S3->GPU, MX handles HBM<->HBM
      • MatrixHub integration: model weights governed/restricted in MatrixHub, P2P transfers via MX
    • [1.2 - 1.4] Platform Foundations
      • torch.compile cache sharing across replicas for cold-start optimization
      • Cross-cluster weight propagation for multi-cluster/data-center deployments
      • S3/object store backend for CSP-native model storage
  • Reinforcement Learning: Turning Dynamo into the rollout engine for RL post-training

    • [1.2] TITO Contract (Token-In / Token-Out)
      • Standardized inference contract for RL rollouts: logprobs passthrough, tokenize/detokenize endpoints, session-aware + KV-aware routing
      • Targeting integrations with veRL, Prime-RL, NeMo RL, Miles, Relax
    • [1.2 - 1.4] RL-Aware Routing & Scheduling
      • KV-aware load balancing across rollout workers to prevent stragglers and improve rollout throughput
      • Multi-pool routing for variable-length RL rollouts: short-context and long-context pools with intelligent sequence-length-based selection and autoscaling
      • Supporting agentic rollouts via cache_hints + G2 offloading
    • [1.2 - 1.3] Weight Transfer for RL
      • High-frequency weight sync from trainer to rollout workers via ModelExpress (NIXL), replacing framework-specific NCCL/UCX glue
      • Supporting multi-cluster/data-center weight updates on heterogeneous hardware
    • [1.2 - 1.3] LoRA-Aware Routing
      • Two-stage routing: LoRA adapter placement followed by KV/load scoring
      • Enables multi-tenant RL where different policy variants share the same base model with adapter-level isolation

Non-LLM

  • Multimodality

    • [1.2] Embedding cache + multimodal routing benchmarks for Qwen3.5VL and Qwen3VL
      • Multimodal inputs (images, video frames) require a dedicated embedding step before the LLM prefill/decode stages. Caching these embeddings avoids redundant visual tokenization when the same image or video frame recurs across requests or conversation turns. Similarly, KV routing routes to the worker with the highest KV cache overlap and skips prefill.
    • [1.2] E/P/D disaggregated serving with Intel B60 + NVIDIA H200 example
      • Extends Dynamo's disaggregated serving model to multimodal workloads by splitting the Embedding stage (visual tokenization), Prefill, and Decode stages across separate workers.
      • A reference example using heterogeneous hardware — Intel B60 for the embedding stage, NVIDIA H200 for prefill and decode — demonstrates cross-vendor disaggregation and enables operators to offload GPU-heavy embedding work to lower-cost accelerators.
    • [1.2] Multimodal Routing
      • Dynamo's Rust preprocessor uses a lightweight token expansion of the multimodal prompts to identify KV-cache-warm workers without fully pre-processing the images, gaining more exact understanding of prefix cache matches on each worker for text+image inputs with low overhead cost on routing decisions.
    • [1.4] Transparent multimodal model pipelining
      • Dynamo orchestrates arbitrary multi-model pipelines (e.g., STT → LLM → TTS) as a single logical request; Dynamo handles scheduling/colocation, streaming, and latency budgeting across stages. This enables use cases like real-time voice assistants and audio-to-audio translation that today require custom orchestration outside of the inference stack.
  • Diffusion

    • [1.3] Streaming output support
      • Basic pipeline (text-encoded messages): Diffusion outputs are delivered incrementally as base64-encoded frames over Dynamo's standard SSE/HTTP streaming path, making them compatible with existing LLM clients without protocol changes.
      • Video diffusion native pipeline (CMAF Binary Streaming): High-bandwidth video frames are streamed as raw binary using CMAF Binary Streaming transport, eliminating the ~33% base64 overhead and reducing end-to-end latency for video generation workloads. Transport selection is configurable by the operator.
    • [1.3] Streaming input support
      • All Dynamo-side plumbing — stream ingestion, partial-prompt routing, and worker handoff — is built and validated for real-time prompt update scenarios (e.g., a user updating a diffusion prompt while a video is still being generated).
      • Real-world use case integration is not targeted in 1.3; mock workers validate the pipeline end-to-end.
    • [1.3] Diffusion pipeline: request plane improvements
      • Reduce data transfer overhead in the diffusion request path between pipeline stages.
  • Generative Recommendation

    • [1.2] Frontend changes to support Triton clients
      • Adds Dynamo frontend support for the Triton inference protocol, allowing existing Triton-based recommendation model clients to route traffic through Dynamo without modification.
      • Enables genrec teams to adopt Dynamo incrementally: the orchestration and request-routing layer migrates to Dynamo while backend model runtimes continue using Triton, reducing migration risk.
    • [1.3] TRT backend support
      • Wraps a TensorRT engine as a custom Dynamo worker, enabling TRT-optimized recommendation models to participate in Dynamo's routing, scheduling, and observability stack without a full backend rewrite.
      • TRT is integrated as a worker rather than a first-class Dynamo backend, keeping the scope bounded while still exposing TRT's performance advantages to genrec workloads.
    • [1.3] Migration of recsys example to Dynamo
      • Ports the HSTU model inference example from Triton Server to Dynamo, producing a validated reference architecture for genrec customers evaluating migration.
      • Validates the full end-to-end stack: Triton client → Dynamo frontend → TRT custom worker → recommendation model output, and serves as a concrete starting point for customer POCs.
  • Voice

    • [1.2] Input streaming support
      • Accepts streaming audio input (chunked PCM or encoded audio frames) directly into the Dynamo request pipeline, allowing the ASR/STT stage to begin processing as audio arrives rather than waiting for a fully buffered clip.
      • Reduces time-to-first-token for voice workloads and enables latency-sensitive applications like real-time transcription and live voice assistants.
    • [1.3] Bidirectional voice model support (Nemotron voice chat style)
      • Supports models that consume and produce audio streams in a single pass (full-duplex), enabling low-latency conversational voice without routing through separate STT + LLM + TTS pipeline stages.
      • End-to-end example with NVIDIA's Nemotron VoiceChat model in Dynamo, producing a reference deployment recipe for real-time conversational voice serving at production scale.
  • Triton Migration

    • [1.2 - 1.3] Triton worker running in Dynamo
      • Dynamo wraps a Triton Inference Server instance as a first-class worker, allowing any existing Triton model configuration (model repository, config.pbtxt, ensemble pipelines) to run inside Dynamo with no model-side changes.
      • Operators with existing Triton deployments can adopt Dynamo as the orchestration layer — gaining request routing, autoscaling, KV-aware scheduling, and observability — while the Triton model runtime and model artifacts remain untouched, enabling the lowest-friction entry point for Triton-to-Dynamo migration.
    • [1.4] Non-LLM inference in Dynamo
      • Extends Dynamo's worker model to natively support non-LLM batching strategies: Dynamic Batching (aggregate requests within a latency budget to maximize GPU utilization), Sequence Batching (stateful models requiring sequence-level affinity across requests), and Direct Workers (single-request passthrough for models that do not benefit from batching).
      • Enables Dynamo to serve audio, recommendation models, and related workloads under the same orchestration stack as LLMs, eliminating the need to maintain separate Triton deployments for non-LLM workloads.

Core & Agents

  • Routing
    • [1.2] Hybrid Mamba model support (Qwen3.5, Nemotron)
      • Approximate mode router benchmarking across round-robin, KV approx, and KV-aware modes
      • Optimized performance for the Qwen3.5 and Nemotron families
    • [1.3] Multi-tenant isolation
    • Multi-cluster routing (see Planner section)
  • KVBM
    • [1.4] Performant CMX and remote storage KV offloading
    • [1.4] SGLang support
    • [1.5] P2P distributed KV cache
  • Agents
    • [1.2] Expand agentic hints support to improve agent performance.
    • [1.4] Profiling for static agentic workflows via agentic planner
    • [1.4] Pause agents waiting for tool calls or the next LLM turn when the system is under pressure, then restore them later.
    • [1.4] KV cache manipulation (offload to colder storage and prefetch when needed)
    • [1.5] CPU orchestration for agents

Metadata

Metadata

Assignees

Labels

roadmapTracks features, enhancements, or milestones planned as part of the project roadmap

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions