You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2. Please use English, otherwise it will be closed.
Motivation
Previously, we have added optimized CPU backend for SGLang for Xeon with AMX support, enabled Graph Mode with torch.compile and extend model coverage.
In 2025Q4, we will continue optimize CPU backend performance primarily focusing on production deployment:
Small to medium sized LLMs deployment, e.g. MoE models with activated parameters less than 5B, e.g. Qwen3-Next-80B-A3B.
OCR models (DeepSeek-OCR), ASR models (whisper), Multimodal models deployment.
General Optimizations
Graph Mode Improvement: combine pre-compiled batch sizes with explicit user inputs configuration to allow more flexible usage of graph runner for improving overall throughput. @CaoE
Causal Conv1d Support: add optimized kernels for mamba attention, causal conv1d and flash linear attention, for model support of Qwen3-Next. @mingfeima
MXFP4 Support: Out of box support of MXFP4 with weight only quantization, dequant MXFP4->BF16 and compute with AMX-BF16 or AVX512-BF16. for model support of GPT-OSS 20B and 120B. @mingfeima
FP8 KV Cache Support: enable usage of FP8 kv cache, fallback to compute with BF16. @blzheng
Data Parallel Attention: enable DP MLA. @chunyuan-w
Innovation
Software pipelining for AMX / AVX512: double buffering for dequant and dot product with AMX / AVX512, increase flops for FP8, MXFP4 and INT4 GEMM and MoE.
User Experience and Testing Enhancement
Documentation: fulfill documentation, and provide BKMs for optimal configurations for prioritized models @ZailiWang
Bug Tracking: track bugs, and enable more proxy models in test cases @1pikachu
Xeon CI: Maintain CI stability, UT enhancement by increasing test case coverage and pass rate @1pikachu
Checklist
Motivation
Previously, we have added optimized CPU backend for SGLang for Xeon with AMX support, enabled Graph Mode with
torch.compileand extend model coverage.In 2025Q4, we will continue optimize CPU backend performance primarily focusing on production deployment:
General Optimizations
Innovation
User Experience and Testing Enhancement