Skip to content

[Ascend]quantization: w4a4, compressed tensors, NZ for non-quantized MOE, Qwen3 MOE double memory consumption fix#11984

Closed
OrangeRedeng wants to merge 56 commits intosgl-project:mainfrom
OrangeRedeng:develop
Closed

[Ascend]quantization: w4a4, compressed tensors, NZ for non-quantized MOE, Qwen3 MOE double memory consumption fix#11984
OrangeRedeng wants to merge 56 commits intosgl-project:mainfrom
OrangeRedeng:develop

Conversation

@OrangeRedeng
Copy link
Copy Markdown
Contributor

@OrangeRedeng OrangeRedeng commented Oct 22, 2025

Motivation

The latest version of CANN and Torch-NPU added support for int4 matmul-s and quantization, switching to lower bit quantization will reduce memory consumption by up to 2-x times compared to int8

Adding weight conversion from ND to FRACTAL_NZ speeds up the GroupedMatmul kernel

Modifications

  1. Support dynamic per-channel per-token w4a4 quantization in SGLang
  2. Support Compressed tensors format on Ascend
  3. Add NZ conversion for non-quantized MOE models (5% speedup)
  4. Fix double memory consumption bug for Qwen MOE models on ascend

Accuracy Tests

W4A4:

Tested on quantized by experimental msmodelslim QWEN3-8B

ASCEND_RT_VISIBLE_DEVICES=4 STREAMS_PER_DEVICE=32 HCCL_OP_EXPANSION_MODE="AIV" AUTO_USE_UC_MEMORY=0 P2P_HCCL_BUFFSIZE=20 python -m sglang.launch_server --cuda-graph-bs 32 64 256 512 --device npu --attention-backend ascend --trust-remote-code --tp-size 1 --model-path ./msit_Qwen8B_mixed_w4a4_w8a8 --port 30088 --quantization w4a4_int4

491005176-eb4ae7ab-2304-40c5-b02d-574623f74feb

NZ:

Qwen-3-30b, GSM8k dataset A2, bf16, tp4
ND cuda-graph-bs = 32 64 256 512 94.5%
NZ cuda-graph-bs = 32 64 256 512 94.4%

Benchmarking and Profiling

W4A4:

python -m sglang.bench_serving --backend sglang --random-range-ratio 1.0 --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name random --num-prompts 64 --max-concurrency 64 --random-input-len 2048 --random-output-len 2048 --host 127.0.0.1 --port 30088 --flush-cache

image

NZ:

NZ vs ND acceleration, x
image

Acceleration from 1% to 12% in benchmark

Checklist

@OrangeRedeng OrangeRedeng changed the title Develop NPU w4a4, compressed tensors and NZ converion support Oct 28, 2025
@OrangeRedeng OrangeRedeng marked this pull request as ready for review October 28, 2025 17:08
@TamirBaydasov
Copy link
Copy Markdown
Contributor

@ping1jing2 ping1jing2 self-assigned this Dec 2, 2025
@ping1jing2 ping1jing2 changed the title ASCEND NPU w4a4, Сompressed tensors, NZ for MOE conversion, MOE double memory consumption fix [Ascend]quantization: w4a4, Сompressed tensors, NZ for MOE conversion, MOE double memory consumption fix Dec 3, 2025
@ping1jing2 ping1jing2 marked this pull request as draft December 3, 2025 10:36
@ping1jing2 ping1jing2 removed the run-ci label Dec 3, 2025
@OrangeRedeng OrangeRedeng changed the title [Ascend]quantization: w4a4, Сompressed tensors, NZ for MOE conversion, MOE double memory consumption fix [Ascend]quantization: w4a4, Сompressed tensors, NZ for non-quantized MOE, Qwen MOE double memory consumption fix Dec 3, 2025
@OrangeRedeng OrangeRedeng changed the title [Ascend]quantization: w4a4, Сompressed tensors, NZ for non-quantized MOE, Qwen MOE double memory consumption fix [Ascend]quantization: w4a4, compressed tensors, NZ for non-quantized MOE, Qwen MOE double memory consumption fix Dec 3, 2025
@OrangeRedeng OrangeRedeng changed the title [Ascend]quantization: w4a4, compressed tensors, NZ for non-quantized MOE, Qwen MOE double memory consumption fix [Ascend]quantization: w4a4, compressed tensors, NZ for non-quantized MOE, Qwen3 MOE double memory consumption fix Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation npu quant LLM Quantization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants