You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Upstream Feature compatibility & reliability: Full compatibility and production-level reliability across P/D disaggregation, all parallelisms, speculative decoding, hierarchical cache, and load balancing on AMD Instinct GPUs (MI30x, MI35x and NextGen).
Usability: Easy installation on AMD; simple large-scale deployment (k8s, OME).
Kernel optimization: For current and next-gen hardware, pursue better performance and higher feature velocity through Triton, TileLang, FlyDSL, HipKittens.
Reinforcement learning: Framework integration and training-inference optimizations with miles, slime.
Multimodal: Enhance diffusion models for video, image and 3D generation, enable more data types support for performance with gen quality. Omni model support.
Feature and Performance Improvements
General features and performance
PoC: @HaiShaw
Goals: Add new models, new AI operators support. Fully distributed inference support to all workloads, on top of full support of new features and enablement with competitive performance defined in general roadmap. Development Roadmap (2026 Q2) #22949
Distributed Inference
PoC: @HaiShaw@1am9trash@sogalin@kkHuang-amd@Lzy17
Goal: DI for all major models (DPSKvX, Qwen3.X, GLM-5X, Kimi-K2.X, MiniMax-M2.X, GPT-OSS) on Helios, multinodes
Speculative Decoding
PoC: @sogalin@kkHuang-amd@hubertlu-tw@1am9trash
Goal: SD/MTP for all major models (DPSKvX, Qwen3.X, GLM-5X, Kimi-K2.X, MiniMax-M2.X, GPT-OSS) on ROCm
Attention Kernels & MxFP4 KV Cache
PoC: @1am9trash@RolaoDenthu@kkHuang-amd
Goals: Improve attention performance with bounded accuracy, on FP8/MxFP4 dtype for Matmul and KV cache
Optimize FP8 attention kernels for NSA/DSA models — including improving the existing tilelang implementation, potential rewrites using triton or flyDSL, and subsequent configuration tuning for better performance
Explore MXFP4 KV cache support for NSA/DSA models to reduce memory footprint.
Explore MXFP4 attention kernels for NSA/DSA models to improve inference throughput with minimal accuracy loss.
Enhance current AMD CI Bot for improved daily report and analysis, from per-job failure triage to a true daily quality dashboard for the AMD CI fleet. Enhance AMD CI Job monitoring for better runner/job status and queue time analysis.
Enable higher performance for SOTA models with Quark quant introduced to specific layer/ops, dynamic weight quant at loading, NVFP4 to MxFP4 weight conversion (dequant and quant) for models interop.
Quantization for Diffusion models, Sage Attention with MxFP4.
Documentation & Recipes
PoC: @sogalin@Lzy17
Goals: Maintain clear and up-to-date documentation, usage recipes, utility scripts and READMEs for DI and all features on AMD.
TODO: This is still WIP. More sections will be added.
AMD SGLang Roadmap — 2026 Q2
Contributions and feedback are welcome. Join AMD's Slack channel.
Focus
Feature and Performance Improvements
General features and performance
PoC: @HaiShaw
Goals: Add new models, new AI operators support. Fully distributed inference support to all workloads, on top of full support of new features and enablement with competitive performance defined in general roadmap. Development Roadmap (2026 Q2) #22949
Distributed Inference
PoC: @HaiShaw @1am9trash @sogalin @kkHuang-amd @Lzy17
Goal: DI for all major models (DPSKvX, Qwen3.X, GLM-5X, Kimi-K2.X, MiniMax-M2.X, GPT-OSS) on Helios, multinodes
Speculative Decoding
PoC: @sogalin @kkHuang-amd @hubertlu-tw @1am9trash
Goal: SD/MTP for all major models (DPSKvX, Qwen3.X, GLM-5X, Kimi-K2.X, MiniMax-M2.X, GPT-OSS) on ROCm
Piecewise CUDA graphs
PoC: @hubertlu-tw
Goals: Add support for piecewise CUDA graphs on AMD GPUs to reduce kernel-launch overhead during the prefill phase. [AMD] Enable Piecewise CUDA Graph for AMD GPUs #22299
Attention Kernels & MxFP4 KV Cache
PoC: @1am9trash @RolaoDenthu @kkHuang-amd
Goals: Improve attention performance with bounded accuracy, on FP8/MxFP4 dtype for Matmul and KV cache
Context parallelism
PoC: @kkHuang-amd, @hubertlu-tw
Goal: Context Parallelism (CP) on ROCm
Multimodal
PoC: @yctseng0211 @yichiche
Issue: FP8 MHA for diffusion model.
Upstream CI
PoC: @yctseng0211 @bingxche @michaelzhang-ai
Goal: Improve coverage, pass rate and fix turn-around.
Kernel fusion & Optimizations
PoC: @hubertlu-tw @kkHuang-amd @1am9trash @RolaoDenthu @yctseng0211
Goals: Performance boost
Quantization and Quark
PoC: @HaiShaw @kkHuang-amd @1am9trash @BowenBao
Goals: Performance Improvement with Assured Accuracy
Documentation & Recipes
PoC: @sogalin @Lzy17
Goals: Maintain clear and up-to-date documentation, usage recipes, utility scripts and READMEs for DI and all features on AMD.
TODO: This is still WIP. More sections will be added.