Skip to content

Development Roadmap (2026 Q2) #22949

@merrymercy

Description

@merrymercy

SGLang Roadmap — 2026 Q2

Contributions and feedback are welcome. Join Slack.

Focus

  • Feature compatibility & reliability: Full compatibility and production-level reliability across P/D disaggregation, all parallelisms, speculative decoding, hierarchical cache, and load balancing.
  • Usability: Easy installation on NV/AMD/TPU/CPU; simple large-scale deployment (k8s, OME).
  • Kernel optimization: For next-gen hardware (GB300/GB200, B300/B200, MI350/MI355, TPU).
  • Reinforcement learning: Framework integration and training-inference mismatch mitigation.
  • Multimodal: Enhance diffusion models for video, image and 3D generation. Omni model support.

Basic Feature Refactors and Improvements

Parallelism

Multimodal

  • Diffusion and Multimodal Generation
    Slack: #diffusion
    Issue: [Roadmap] SGLang-Diffusion (26 Q2) #23035

  • MLLM, VLM, and Multimodal Perception
    Slack: #multi-modal
    Issue: [Roadmap] Multimodal LLM (26 Q2) #23036

  • SGLang Omni
    PoC: @zhaochenyang20 @FrankLeeeee
    Repo: https://github.com/sgl-project/sglang-omni
    Currently supported: Fish Audio S2 Pro (Dual AR), Qwen3-Omni (Thinker-Talker).
    Q2 goals:

    • RFC Using mlx as backend #188 refactor: collapse Stage→Worker→Executor→Engine to Stage→Engine; ~33K → ~10K lines; request-path depth 8–10 → 6 with no accuracy/perf regression.
    • Day-zero serving for new Omni/audio-gen models: 3+ models at production quality; integration cost ~2 weeks → <1 week post-refactor.
    • Benchmark CI: extend the task × model framework (PR add cert functionality #223) with audio quality metrics, Video MMU, calibrated regression thresholds.
    • Production observability: per-stage latency breakdown, token-level tracing, audio quality monitoring.
    • Performance: generalize S2 Pro's CUDA Graph + torch.compile path (55.8 → 120 TPS) into a reusable abstraction; close Qwen3 Omni Talker gap.
    • Omni RL: expose rollout interface to Miles (joint with RL workstream).

Hardware

Kernels

  • Experiment with MegaKenrel integration
  • Move more kernels to JIT style
  • Integrate more communication-compute overlap kernels
  • Integrate more quantization kernels (nvfp4, mxfp8)

Reliability and Observability

  • Dumping tools for fixing cuda illegal memory access
  • Better per request tracing
  • Runtime memory pool check, PD transfer checksum, weight checksum

RL Framework Integration

  • Miles
    PoC: @yueming-yuan @fzyzcjy
    Repo: https://github.com/radixark/miles
    Landed: Unified FP8 E2E (blog); R3 routing replay for MoE (paper); INT4 QAT closed loop (blog); speculative RL with online SFT draft; zero-copy CUDA IPC weight sync; TIS/MIS off-policy correction; VLM multi-turn; MrlX multi-agent.
    Q2 goals: Zero mismatch for MoE RL; SGLang↔Megatron parity for MoE (TP/EP/PP); Diffusion / Omni / dLLM RL via shared rollout interface; elastic rollout-vs-training scheduling.

  • slime, verl, AReaL
    PoC @zhaochenyang20
    slime, verl, AReaL — Maintain SGLang as a first-class rollout backend across the major external RL frameworks. (slime) is the upstream Miles tracks and the reference for SGLang-native recipes. (verl) is the industry-adopted Volcano Engine framework. (AReaL) is the async RL framework from Ant / Tsinghua.
    Goals: converge on one stable SGLang rollout-engine API to cut per-framework drift on weight sync, sampling, and logprob semantics; upstream shared primitives (R3, FP8, deterministic inference, TIS/MIS) so all four frameworks benefit together.

Multi-LoRA Serving

TODO

Model Coverage

CI / Release / Maintenance



TODO: This is still WIP. More sections will be added.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions