OSS Model Performance Optimization
-
Objective: Improve mainstream OSS model performance for developers on the latest Nvidia hardware.
-
Note: Agg means aggregated serving (P and D on same worker); Disagg means PD disaggregation
-
Qwen3.5-397B (G)B200/(G)B300/Hopper
- April: Agg Round 1 — GDN kernel, MNNVL All Reduce, spec_V2 MTP; Disagg functional
- May: Disagg sweeps + Round 2
- June: Agg/Disagg Round 3
-
gpt-oss-120B
- April: Agg Round 2 — kernel fusions, communication kernels, memcpy removal
-
DeepSeek (G)B200/(G)B300
- April: Updated rate-matching sweeps; New sweeps MTP on/off
- May: DeepSeek V3.2 long context Round 2
- June: DeepSeek V3.2 long context Round 3
-
Nemotron V3 (G)B200Hopper/Spark
- April: async schedulin/g + prefix caching; Testing refactor; Return intermediate state
- May: Disagg tuning/sweeps/kernels
- June: Ultra support
-
GLM-5 (G)B200/(G)B300/Hopper
- April: FP8/NVFP4 Functional; Agg Round 1
- May: Disagg sweeps + Round 2
- June: Agg/Disagg Round 3
-
Minimax-M2.5
- April: FP8+NVFP4 Agg functional; Agg gap analysis
- May: Agg Round 1
-
Qwen3-Coder-480B
- April: FP8+NVFP4 Agg functional
-
GLM-4.7
- April: Agg functional; Agg gap analysis
- May: Agg Round 1
Runtime Optimizations
Objective: Incorporate and improve state-of-the-art runtime features to benefit all SGLang developers.
CI/CD/Dependencies
Dynamo
-
K8s + Planner
- Enable deeper scheduler level forward pass metrics so customers can leverage planner for optimal engine tuning/performance
- Enable SGLang with Dynamo K8s + Grove for large scale GB200/300 deployment
-
Agentic Optimizations
-
HiCache
- Enable Dynamo router to route to workers based on KV cache at multiple tiers
- Selective prefetch/evict APIs for more granular control
- Recipes for deploying Dynamo with shared KV cache tier
-
More blog posts!
Documentation / Recipes / Blogs
- Cookbooks/Docs: Better documentations on follwoing models: Qwen3.5/gpt-oss/GLM-5/MiniMax-M2.5/Qwen3-Coder/GLM-4.7
Miles — RadixArk RL Framework
-
Enablement & Stability: Collaborating with SGLang to bring up GB300 (possibly GB200) support in main branch; ensure key workflows function correctly (CI/CD validation + feature coverage).
-
Performance Track & Optimization: Establish perf baselines on GB300, continuously track regressions, and iterate on fixes after initial bring-up.
Related resources
No response
OSS Model Performance Optimization
Objective: Improve mainstream OSS model performance for developers on the latest Nvidia hardware.
Note:
Aggmeans aggregated serving (P and D on same worker);Disaggmeans PD disaggregationQwen3.5-397B (G)B200/(G)B300/Hopper
gpt-oss-120B
DeepSeek (G)B200/(G)B300
Nemotron V3 (G)B200Hopper/Spark
GLM-5 (G)B200/(G)B300/Hopper
Minimax-M2.5
Qwen3-Coder-480B
GLM-4.7
Runtime Optimizations
Objective: Incorporate and improve state-of-the-art runtime features to benefit all SGLang developers.
Context parallel
Roadmap: [Roadmap] Context Parallelism (2026 Q2) #21788
GB200/GB300 optimization
Issue: [Roadmap] GB200/GB300 development for Q2 #19650
Agentic Workload
Issue: [Roadmap]: SGLang Distributed KVCache System For Agentic Workload #21846
Communication
Video/Image Decode/Pre-process
Enabling SGLang on Rubin systems
CI/CD/Dependencies
Roadmap: [Tracking] SGLang CI/CD Test Coverage Improvements - Q2 2026 Roadmap #20847
Issue: [Feature] Improve Unit Test Coverage #20865
Roadmap: [Feature] Upgrade default Cuda version to 13.0 #21498
Dynamo
K8s + Planner
Agentic Optimizations
HiCache
More blog posts!
Documentation / Recipes / Blogs
Miles — RadixArk RL Framework
Enablement & Stability: Collaborating with SGLang to bring up GB300 (possibly GB200) support in main branch; ensure key workflows function correctly (CI/CD validation + feature coverage).
Performance Track & Optimization: Establish perf baselines on GB300, continuously track regressions, and iterate on fixes after initial bring-up.
Related resources
No response