1. Quantization refactor
Background
Scheme structure
Currently, quantization methods are mostly implemented in 2 ways: with schemes structure, like compressed_tensors and quark, and without, like modelslim for NPU or AWQ. While foregoing scheme structure can be ok if we only load 1 format, having ability to load many of them without specific logic will overload get_quant_method function. Overall, the more quant methods get supported, the bigger each respective file gets.
Weight loading and inference
The weight creation and inference code is currently implemented in the same class, but same inference code can be utilized by different frameworks.
Motivation
Support more scheme structures
The key improvement of following the scheme structure is that it is much easier to maintain and update, allowing for easier implementation and review process.
Below is the example of proposed scheme structure change:
Split weight loading and inference
Quant config, weight creation and kernel call logic should be clearly separated to allow different frameworks to use the same kernel if it fits. This will avoid code duplication and thus increase code readability and avoid circular imports. Our end goal is to create a unified and simpler structure for quantization functionality. The main source of inspiration for refactoring ideas came from compressed-tensors scheme structure.
Below are the image examples of proposed change for AWQ:

2. Non-Linear Module & Communication Quantization
Objective: Optimize components beyond standard linear layers to further improve performance.
3. New formats support
1. Quantization refactor
Background
Scheme structure
Currently, quantization methods are mostly implemented in 2 ways: with schemes structure, like
compressed_tensorsandquark, and without, likemodelslimfor NPU orAWQ. While foregoing scheme structure can be ok if we only load 1 format, having ability to load many of them without specific logic will overloadget_quant_methodfunction. Overall, the more quant methods get supported, the bigger each respective file gets.Weight loading and inference
The weight creation and inference code is currently implemented in the same class, but same inference code can be utilized by different frameworks.
Motivation
Support more scheme structures
The key improvement of following the scheme structure is that it is much easier to maintain and update, allowing for easier implementation and review process.
Below is the example of proposed scheme structure change:
Split weight loading and inference
Quant config, weight creation and kernel call logic should be clearly separated to allow different frameworks to use the same kernel if it fits. This will avoid code duplication and thus increase code readability and avoid circular imports. Our end goal is to create a unified and simpler structure for quantization functionality. The main source of inspiration for refactoring ideas came from compressed-tensors scheme structure.

Below are the image examples of proposed change for AWQ:
NPU specific refactoring and format support [NPU] [Roadmap] NPU quantization 2026 Q1 Roadmap #14424
Compressed-Tensors, ModelSlim, Quark MoE schemes
sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py
Line 668 in f8636fb
Additional scheme support
Kernel call and weight init split
2. Non-Linear Module & Communication Quantization
Objective: Optimize components beyond standard linear layers to further improve performance.
Attention
MLA Quantization @hammersam
[WIP] [NPU] FIA quant mode implementation #24695
Improved KV Cache Quantization
Communication Quantization
3. New formats support
NVFP4 Quantization support
Roadmap: SGLang Nvidia Collaboration Roadmap (2026 Q1) #17130
Improved AutoQuantize
Roadmap: SGLang Nvidia Collaboration Roadmap (2026 Q1) #17130
FP4 KV-Cache Support
Roadmap: SGLang Nvidia Collaboration Roadmap (2026 Q1) #17130
mxfp8 support @zianglih Add mxfp8 support for online quantization, Triton dense linear, and CUTLASS MoE #17449
Online Rotation (for FlatQuant and etc.)
Vector quantization (for QuIP#, AQLM, VPTQ)