Nvidia Collaboration Roadmap (2026 Q2)

## OSS Model Performance Optimization
- *Objective: Improve mainstream OSS model performance for developers on the latest Nvidia hardware.*
- Note: `Agg` means aggregated serving (P and D on same worker); `Disagg` means PD disaggregation
- **Qwen3.5-397B** (G)B200/(G)B300/Hopper
  - April: Agg Round 1 — GDN kernel, MNNVL All Reduce, spec_V2 MTP; Disagg functional
  - May: Disagg sweeps + Round 2
  - June: Agg/Disagg Round 3

- **gpt-oss-120B**
  - April: Agg Round 2 — kernel fusions, communication kernels, memcpy removal

- **DeepSeek** (G)B200/(G)B300
  - April: Updated rate-matching sweeps; New sweeps MTP on/off
  - May: DeepSeek V3.2 long context Round 2
  - June: DeepSeek V3.2 long context Round 3

- **Nemotron V3** (G)B200Hopper/Spark
  - April: async schedulin/g + prefix caching; Testing refactor; Return intermediate state
  - May: Disagg tuning/sweeps/kernels
  - June: Ultra support

- **GLM-5** (G)B200/(G)B300/Hopper
  - April: FP8/NVFP4 Functional; Agg Round 1
  - May: Disagg sweeps + Round 2
  - June: Agg/Disagg Round 3

- **Minimax-M2.5**
  - April: FP8+NVFP4 Agg functional; Agg gap analysis
  - May: Agg Round 1

- **Qwen3-Coder-480B**
  - April: FP8+NVFP4 Agg functional

- **GLM-4.7**
  - April: Agg functional; Agg gap analysis
  - May: Agg Round 1


## Runtime Optimizations

*Objective: Incorporate and improve state-of-the-art runtime features to benefit all SGLang developers.*

- **Context parallel**
  Roadmap: https://github.com/sgl-project/sglang/issues/21788
  - **Helix Parallelism**
    - April: Add decode Context Parallelism support
    - May: Core Helix MLA+GQA support
    - June: Extend with A2A optimizations

- **GB200/GB300 optimization**
  Issue: https://github.com/sgl-project/sglang/issues/19650

- **Agentic Workload**
  Issue: https://github.com/sgl-project/sglang/issues/21846

- **Communication**
  - Support NCCL-EP all to all backend
  - Shift DeepEP to hybrid_ep branch
  - Benchmark trtllm comm kernels and optionally set them to default

- **Video/Image Decode/Pre-process**
  - April: Video Decode GPU acceleration
  - May: Image Decode GPU Acceleration
  - June: Decode/Pre-compute GPU acceleration

- Enabling SGLang on Rubin systems


## CI/CD/Dependencies

- **CI/CD Test Coverage Improvements**
  Roadmap: https://github.com/sgl-project/sglang/issues/20847
  - Full E2E + Disagg
    - April: First test: DSR1 FP8 1P1D DEP8
    - May: Expand features/configs
    - June: Expand to more models + transfer layers
  - Full E2E + Agg
    - April: Key models: DSR1, Qwen3.5, GLM5
    - May: More models: Minimax-M2.5, Qwen3-Coder
    - June: TBD
  - Reduced Layers
    - April: PoC: DSR1 (4 layers) on H200/B200
    - May: Expand: 3 models x many configs
    - June: More models
  - Kernel Tests
  - Enhance test coverage on GB200/GB300 (including wideep tests)
- **Improve Unit test coverage**
  Issue: https://github.com/sgl-project/sglang/issues/20865
- **Set Cuda 13.1 as default cuda version** (depending on Torch 2.11)
  Roadmap: https://github.com/sgl-project/sglang/issues/21498

## Dynamo

- **K8s + Planner**
  - Enable deeper scheduler level forward pass metrics so customers can leverage planner for optimal engine tuning/performance
  - Enable SGLang with Dynamo K8s + Grove for large scale GB200/300 deployment

- **Agentic Optimizations**
  - KV Cache manipulation for subagents (Agentic Workload Roadmap https://github.com/sgl-project/sglang/issues/21846)
  - Decode side radix cache

- **HiCache**
  - Enable Dynamo router to route to workers based on KV cache at multiple tiers
  - Selective prefetch/evict APIs for more granular control
  - Recipes for deploying Dynamo with shared KV cache tier

- **More blog posts!**

## Documentation / Recipes / Blogs

- **Cookbooks/Docs**: Better documentations on follwoing models: Qwen3.5/gpt-oss/GLM-5/MiniMax-M2.5/Qwen3-Coder/GLM-4.7


## Miles — RadixArk RL Framework

- **Enablement & Stability**: Collaborating with SGLang to bring up GB300 (possibly GB200) support in main branch; ensure key workflows function correctly (CI/CD validation + feature coverage).

- **Performance Track & Optimization**: Establish perf baselines on GB300, continuously track regressions, and iterate on fixes after initial bring-up.



### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nvidia Collaboration Roadmap (2026 Q2) #22960

OSS Model Performance Optimization

Runtime Optimizations

CI/CD/Dependencies

Dynamo

Documentation / Recipes / Blogs

Miles — RadixArk RL Framework

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nvidia Collaboration Roadmap (2026 Q2) #22960

Description

OSS Model Performance Optimization

Runtime Optimizations

CI/CD/Dependencies

Dynamo

Documentation / Recipes / Blogs

Miles — RadixArk RL Framework

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions