You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Contributions and feedback are welcome. Join Slack.
Focus
Feature compatibility & reliability: Full compatibility and production-level reliability across P/D disaggregation, all parallelisms, speculative decoding, hierarchical cache, and load balancing.
Usability: Easy installation on NV/AMD/TPU/CPU; simple large-scale deployment (k8s, OME).
Kernel optimization: For next-gen hardware (GB300/GB200, B300/B200, MI350/MI355, TPU).
Reinforcement learning: Framework integration and training-inference mismatch mitigation.
Multimodal: Enhance diffusion models for video, image and 3D generation. Omni model support.
Basic Feature Refactors and Improvements
Scheduler refactor
PoC: @hnyls2002
Slack: #dev, #spec-decoding
Goals: Make forward-mode more general; Make the scheduler more stateless; Fully support mixed chunked prefill; Hide CPU overhead in scheduler for all cases; Move more preparation into cuda graph. [Feature] Overlap Spec Support #11762
Speculative decoding
PoC: @Qiaolin-Yu
Slack: #spec-decoding
Goals: General abstraction for more spec algorithms; General abstraction for spec graph preparation & init; Adaptive spec configurations for different requests and batch sizes.
RFC Using mlx as backend #188 refactor: collapse Stage→Worker→Executor→Engine to Stage→Engine; ~33K → ~10K lines; request-path depth 8–10 → 6 with no accuracy/perf regression.
Day-zero serving for new Omni/audio-gen models: 3+ models at production quality; integration cost ~2 weeks → <1 week post-refactor.
Benchmark CI: extend the task × model framework (PR add cert functionality #223) with audio quality metrics, Video MMU, calibrated regression thresholds.
Production observability: per-stage latency breakdown, token-level tracing, audio quality monitoring.
Performance: generalize S2 Pro's CUDA Graph + torch.compile path (55.8 → 120 TPS) into a reusable abstraction; close Qwen3 Omni Talker gap.
Omni RL: expose rollout interface to Miles (joint with RL workstream).
Integrate more communication-compute overlap kernels
Integrate more quantization kernels (nvfp4, mxfp8)
Reliability and Observability
Dumping tools for fixing cuda illegal memory access
Better per request tracing
Runtime memory pool check, PD transfer checksum, weight checksum
RL Framework Integration
Miles
PoC: @yueming-yuan@fzyzcjy
Repo: https://github.com/radixark/miles
Landed: Unified FP8 E2E (blog); R3 routing replay for MoE (paper); INT4 QAT closed loop (blog); speculative RL with online SFT draft; zero-copy CUDA IPC weight sync; TIS/MIS off-policy correction; VLM multi-turn; MrlX multi-agent.
Q2 goals: Zero mismatch for MoE RL; SGLang↔Megatron parity for MoE (TP/EP/PP); Diffusion / Omni / dLLM RL via shared rollout interface; elastic rollout-vs-training scheduling.
slime, verl, AReaL
PoC @zhaochenyang20
slime, verl, AReaL — Maintain SGLang as a first-class rollout backend across the major external RL frameworks. (slime) is the upstream Miles tracks and the reference for SGLang-native recipes. (verl) is the industry-adopted Volcano Engine framework. (AReaL) is the async RL framework from Ant / Tsinghua.
Goals: converge on one stable SGLang rollout-engine API to cut per-framework drift on weight sync, sampling, and logprob semantics; upstream shared primitives (R3, FP8, deterministic inference, TIS/MIS) so all four frameworks benefit together.
SGLang Roadmap — 2026 Q2
Contributions and feedback are welcome. Join Slack.
Focus
Basic Feature Refactors and Improvements
Scheduler refactor
PoC: @hnyls2002
Slack: #dev, #spec-decoding
Goals: Make forward-mode more general; Make the scheduler more stateless; Fully support mixed chunked prefill; Hide CPU overhead in scheduler for all cases; Move more preparation into cuda graph. [Feature] Overlap Spec Support #11762
KV Cache management
PoC: @ispobock @hzh0425 @xiezhq-hermann
Slack: #kv-cache-store, #hybrid-model
Goals: Make hierarchical cache and hybrid attention the native feature; Support flexible session control for agentic workloads
Speculative decoding
PoC: @Qiaolin-Yu
Slack: #spec-decoding
Goals: General abstraction for more spec algorithms; General abstraction for spec graph preparation & init; Adaptive spec configurations for different requests and batch sizes.
PD disaggregation
PoC: @ShangmingCai
Slack: #pd-disaggregation
Goals: [Roadmap] Prefill-Decode Disaggregation Roadmap (2026 Q2) #21703
API Server
PoC: @alexnails
Slack: #sglang-grpc-rfc-22558 #rust-migration
Goals: [RFC] Native gRPC Server for SGLang in Rust #22558
Rust migration
PoC: @ishandhanani @rainj-me
Slack: #rust-migration
Goals: Gradually rewrite most components (scheduler, api server, prefix tree) in Rust
Cuda graph runner backend
PoC: @Oasis-Git
Slack: #piecewise-cuda-graph
Goals: Support flexible cuda graph backends (decode, prefill) x (full, breakable, torch-compile-based pcg); Enable breakable cuda graph for prefill by default. [RFC] Cuda Graph Runner Backend Refactor #23004
Parallelism
Pipeline parallelism refactor for long-context prefill and high-throughput decoding
PoC: @ShangmingCai
Slack: #pipeline-parallel
Issue: [Roadmap] Pipeline parallelism roadmap #11857
Expert parallelism
Slack: #expert-parallel
Issue: [Roadmap] GB200/GB300 development for Q2 #19650, [Roadmap] MoE Refactor #8715
Elastic parallel PRs: [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP #10423, [4/N]Elastic EP support deepep backend #11837
Data parallelism attention refactor
Issue: [Feature] Load Balance Refactor for DP-Attention #16080
Context parallelism
PoC: @Fridge003 @kpham-sgl @ShangmingCai @ch-wan
Slack: #context-parallel
Issue: [Roadmap] Context Parallelism (2026 Q2) #21788
Distributed Weight Data Parallelism
PoC: @yuhao318
Issue: [Feature] Distributed Weight Data Parallelism (DWDP) for Sparse MoE Models #22084
GB200/GB300 NVL72 optimizations
PoC: @Fridge003
Slack: #deepseek-large-scale-serving
Issue: [Roadmap] GB200/GB300 development for Q2 #19650
Multimodal
Diffusion and Multimodal Generation
Slack: #diffusion
Issue: [Roadmap] SGLang-Diffusion (26 Q2) #23035
MLLM, VLM, and Multimodal Perception
Slack: #multi-modal
Issue: [Roadmap] Multimodal LLM (26 Q2) #23036
SGLang Omni
PoC: @zhaochenyang20 @FrankLeeeee
Repo: https://github.com/sgl-project/sglang-omni
Currently supported: Fish Audio S2 Pro (Dual AR), Qwen3-Omni (Thinker-Talker).
Q2 goals:
task × modelframework (PR add cert functionality #223) with audio quality metrics, Video MMU, calibrated regression thresholds.Hardware
General multi-hardware abstraction
PoC: @alexnails
Issue: Multi platform Plugin #21388
NVIDIA collaboration
Issue: Nvidia Collaboration Roadmap (2026 Q2) #22960
AMD extension & Specification on top of Q2 above
PoC: @HaiShaw
Issue: AMD Development Roadmap (2026 Q2) #23494
TPU
TorchTPU based solution
Jax-based solution: Development Roadmap (2026 Q2) sglang-jax#909
Slack: #dev-jax-tpu
NPU
PoC: @iforgetmyname @ZhengdQin
Issue: [Roadmap] Ascend NPU Development (2026 Q1) #13664
Intel CPU/XPU:
Kernels
Reliability and Observability
RL Framework Integration
Miles
PoC: @yueming-yuan @fzyzcjy
Repo: https://github.com/radixark/miles
Landed: Unified FP8 E2E (blog); R3 routing replay for MoE (paper); INT4 QAT closed loop (blog); speculative RL with online SFT draft; zero-copy CUDA IPC weight sync; TIS/MIS off-policy correction; VLM multi-turn; MrlX multi-agent.
Q2 goals: Zero mismatch for MoE RL; SGLang↔Megatron parity for MoE (TP/EP/PP); Diffusion / Omni / dLLM RL via shared rollout interface; elastic rollout-vs-training scheduling.
slime, verl, AReaL
PoC @zhaochenyang20
slime, verl, AReaL — Maintain SGLang as a first-class rollout backend across the major external RL frameworks. (slime) is the upstream Miles tracks and the reference for SGLang-native recipes. (verl) is the industry-adopted Volcano Engine framework. (AReaL) is the async RL framework from Ant / Tsinghua.
Goals: converge on one stable SGLang rollout-engine API to cut per-framework drift on weight sync, sampling, and logprob semantics; upstream shared primitives (R3, FP8, deterministic inference, TIS/MIS) so all four frameworks benefit together.
Multi-LoRA Serving
TODO
Model Coverage
PoC: @wisclmy0611 @JustinTong0323
Slack: #dev
CI / Release / Maintenance
Improve stability, increase coverage, and reduce flakiness.
PoC: @alisonshao @Kangyan-Zhou
Slack: #ci-cd-build-release
Issue: [CI Infrastructure] Roadmap: Regression-Based CI Checks #21157, [Feature] Improve Unit Test Coverage #20865, [Tracking] SGLang CI/CD Test Coverage Improvements - Q2 2026 Roadmap #20847
Mock model for correctness tests
TODO: This is still WIP. More sections will be added.