AMD Development Roadmap (2026 Q2)

# AMD SGLang Roadmap — 2026 Q2

*Contributions and feedback are welcome*. [Join AMD's Slack channel](https://sgl-fru7574.slack.com/archives/C09TPAM1N68).

## Focus

- **Upstream Feature compatibility & reliability**: Full compatibility and production-level reliability across P/D disaggregation, all parallelisms, speculative decoding, hierarchical cache, and load balancing on AMD Instinct GPUs (MI30x, MI35x and NextGen).
- **Usability**: Easy installation on AMD; simple large-scale deployment (k8s, OME).
- **Kernel optimization**: For current and next-gen hardware, pursue better performance and higher feature velocity through Triton, TileLang, FlyDSL, HipKittens.
- **Reinforcement learning**: Framework integration and training-inference optimizations with miles, slime.
- **Multimodal**: Enhance diffusion models for video, image and 3D generation, enable more data types support for performance with gen quality. Omni model support.

## Feature and Performance Improvements

- **General features and performance**
  PoC: @HaiShaw   
  Goals: Add new models, new AI operators support. Fully distributed inference support to all workloads, on top of full support of new features and enablement with competitive performance defined in general roadmap. https://github.com/sgl-project/sglang/issues/22949

- **Distributed Inference**
  PoC: @HaiShaw @1am9trash @sogalin @kkHuang-amd @Lzy17 
  Goal: DI for all major models (DPSKvX, Qwen3.X, GLM-5X, Kimi-K2.X, MiniMax-M2.X, GPT-OSS) on Helios, multinodes

- **Speculative Decoding**
  PoC: @sogalin @kkHuang-amd @hubertlu-tw @1am9trash 
  Goal: SD/MTP for all major models (DPSKvX, Qwen3.X, GLM-5X, Kimi-K2.X, MiniMax-M2.X, GPT-OSS) on ROCm
  - https://github.com/sgl-project/sglang/pull/23388 (@zhentaocc)

- **Piecewise CUDA graphs**
  PoC: @hubertlu-tw   
  Goals: Add support for piecewise CUDA graphs on AMD GPUs to reduce kernel-launch overhead during the prefill phase. https://github.com/sgl-project/sglang/pull/22299

- **Attention Kernels & MxFP4 KV Cache**
  PoC: @1am9trash @RolaoDenthu @kkHuang-amd 
  Goals: Improve attention performance with bounded accuracy, on FP8/MxFP4 dtype for Matmul and KV cache
    - Optimize FP8 attention kernels for NSA/DSA models — including improving the existing tilelang implementation, potential rewrites using triton or flyDSL, and subsequent configuration tuning for better performance
    - Explore MXFP4 KV cache support for NSA/DSA models to reduce memory footprint.
    - Explore MXFP4 attention kernels for NSA/DSA models to improve inference throughput with minimal accuracy loss. 

- **Context parallelism**
  PoC: @kkHuang-amd, @hubertlu-tw
  Goal: Context Parallelism (CP) on ROCm
    - Decode Context Parallelism (DCP) on ROCm.
    - Prefill Context Parallelism (PCP) for MLA models. https://github.com/sgl-project/sglang/pull/19975

- **Multimodal**
  PoC: @yctseng0211 @yichiche   
  Issue: FP8 MHA for diffusion model. 

- **Upstream CI**
  PoC: @yctseng0211 @bingxche @michaelzhang-ai 
  Goal: Improve coverage, pass rate and fix turn-around.
    - Instantiate multi-nodes Distributed Inference CI for PD/D & wide-EP.
    - Improve end to end image build/release & PR test workflow, CI regression detection and analysis with AI assistance. https://github.com/sgl-project/sglang/pull/21720
    - Enhance current AMD CI Bot for improved daily report and analysis, from per-job failure triage to a true daily quality dashboard for the AMD CI fleet. Enhance AMD CI Job monitoring for better runner/job status and queue time analysis.

- **Kernel fusion & Optimizations**
  PoC: @hubertlu-tw @kkHuang-amd @1am9trash @RolaoDenthu  @yctseng0211 
  Goals: Performance boost
    - Add a fused ar + RMSNorm + quantization kernel under the --enable-aiter-allreduce-fusion flag to provide more fusion opportunities and faster inference. https://github.com/sgl-project/sglang/pull/24651
    - Fusion kernel for diffusion model using Triton or FlyDSL.

- **Quantization and Quark**
  PoC: @HaiShaw @kkHuang-amd @1am9trash @BowenBao 
  Goals: Performance Improvement with Assured Accuracy
    - Enable higher performance for SOTA models with Quark quant introduced to specific layer/ops, dynamic weight quant at loading, NVFP4 to MxFP4 weight conversion (dequant and quant) for models interop.
    - Quantization for Diffusion models, Sage Attention with MxFP4.

- **Documentation & Recipes**
  PoC: @sogalin @Lzy17 
  Goals: Maintain clear and up-to-date documentation, usage recipes, utility scripts and READMEs for DI and all features on AMD.

------

TODO: This is still WIP. More sections will be added.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD Development Roadmap (2026 Q2) #23494

AMD SGLang Roadmap — 2026 Q2

Focus

Feature and Performance Improvements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AMD Development Roadmap (2026 Q2) #23494

Description

AMD SGLang Roadmap — 2026 Q2

Focus

Feature and Performance Improvements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions