[Ascend]quantization: w4a4, compressed tensors, NZ for non-quantized MOE, Qwen3 MOE double memory consumption fix by OrangeRedeng · Pull Request #11984 · sgl-project/sglang

OrangeRedeng · 2025-10-22T21:23:48Z

Motivation

The latest version of CANN and Torch-NPU added support for int4 matmul-s and quantization, switching to lower bit quantization will reduce memory consumption by up to 2-x times compared to int8

Adding weight conversion from ND to FRACTAL_NZ speeds up the GroupedMatmul kernel

Modifications

Support dynamic per-channel per-token w4a4 quantization in SGLang
Support Compressed tensors format on Ascend
Add NZ conversion for non-quantized MOE models (5% speedup)
Fix double memory consumption bug for Qwen MOE models on ascend

Accuracy Tests

W4A4:

Tested on quantized by experimental msmodelslim QWEN3-8B

ASCEND_RT_VISIBLE_DEVICES=4 STREAMS_PER_DEVICE=32 HCCL_OP_EXPANSION_MODE="AIV" AUTO_USE_UC_MEMORY=0 P2P_HCCL_BUFFSIZE=20 python -m sglang.launch_server --cuda-graph-bs 32 64 256 512 --device npu --attention-backend ascend --trust-remote-code --tp-size 1 --model-path ./msit_Qwen8B_mixed_w4a4_w8a8 --port 30088 --quantization w4a4_int4

491005176-eb4ae7ab-2304-40c5-b02d-574623f74feb

NZ:

Qwen-3-30b, GSM8k dataset	A2, bf16, tp4
ND cuda-graph-bs = 32 64 256 512	94.5%
NZ cuda-graph-bs = 32 64 256 512	94.4%

Benchmarking and Profiling

W4A4:

python -m sglang.bench_serving --backend sglang --random-range-ratio 1.0 --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name random --num-prompts 64 --max-concurrency 64 --random-input-len 2048 --random-output-len 2048 --host 127.0.0.1 --port 30088 --flush-cache

NZ:

NZ vs ND acceleration, x

Acceleration from 1% to 12% in benchmark

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests. Related to about requirements #284
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

TamirBaydasov · 2025-10-29T11:18:16Z

Compressed-tensors w8a8 support.

Model:
https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8
GSM8K:

Model:
https://huggingface.co/ramblingpolymath/Qwen3-30B-A3B-Instruct-2507-W8A8
GSM8K:

This reverts commit 8074391.

…cend_memory_consumption.py

OrangeRedeng and others added 5 commits October 23, 2025 00:07

Add W4A4 support

8f596de

Support NZ linear for FP&W8A8 MOE

a810719

Update pre-commit

836bd0c

w8a8 compressed-tensors scheme implementation for Ascend

da6bde6

Merge branch 'main' into develop

3b441c1

OrangeRedeng changed the title ~~Develop~~ NPU w4a4, compressed tensors and NZ converion support Oct 28, 2025

OrangeRedeng marked this pull request as ready for review October 28, 2025 17:08

OrangeRedeng requested review from BBuf, Edwardf0t1, FlamingoPg, HaiShaw, JustinTong0323, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy, mickqian, ping1jing2, xiezhq-hermann, yizhang2077 and zhyncs as code owners October 28, 2025 17:08

Linter

8074391

Tamir Baidasov and others added 5 commits October 29, 2025 19:26

Revert "Linter"

78ee231

This reverts commit 8074391.

Excess changes revert

d88f05e

PR linter fix

e9ddc85

PR Linter fix ping1jing2#2

b62c509

Merge branch 'main' into develop

e44b32e

OrangeRedeng and others added 4 commits December 1, 2025 15:04

Merge branch 'main' into develop

cfd410c

Merge branch 'main' into develop

46a3d45

Circular import fix for compressed tensors schemes

960f864

Lint fix

f224879

ping1jing2 self-assigned this Dec 2, 2025

ping1jing2 changed the title ~~ASCEND NPU w4a4, Сompressed tensors, NZ for MOE conversion, MOE double memory consumption fix~~ [Ascend]quantization: w4a4, Сompressed tensors, NZ for MOE conversion, MOE double memory consumption fix Dec 3, 2025

OrangeRedeng added 6 commits December 3, 2025 11:09

Merge branch 'main' into develop

9ebc971

Create test_ascend_double_memory_consumption.py

2e68802

Update test_ascend_double_memory_consumption.py

d3d4d7e

Update test_ascend_double_memory_consumption.py

5b641ad

Update test_ascend_w4a4_quantization.py

f1c26c9

Fix parsing

46da79f

ping1jing2 marked this pull request as draft December 3, 2025 10:36

ping1jing2 removed the run-ci label Dec 3, 2025

OrangeRedeng added 8 commits December 3, 2025 13:58

Fix lint issue

b1dae54

Fix lint issue again

7b906f4

Remove unnecessary function

fd6999e

Add tmp files removing

70e490b

Redesign ascend memory consumption test

ccfa51e

Update test_ascend_double_memory_consumption.py

e0a8aa1

Fix lint issue

ea1026a

Update and rename test_ascend_double_memory_consumption.py to test_as…

db9e92d

…cend_memory_consumption.py

OrangeRedeng changed the title ~~[Ascend]quantization: w4a4, Сompressed tensors, NZ for MOE conversion, MOE double memory consumption fix~~ [Ascend]quantization: w4a4, Сompressed tensors, NZ for non-quantized MOE, Qwen MOE double memory consumption fix Dec 3, 2025

OrangeRedeng changed the title ~~[Ascend]quantization: w4a4, Сompressed tensors, NZ for non-quantized MOE, Qwen MOE double memory consumption fix~~ [Ascend]quantization: w4a4, compressed tensors, NZ for non-quantized MOE, Qwen MOE double memory consumption fix Dec 3, 2025

OrangeRedeng changed the title ~~[Ascend]quantization: w4a4, compressed tensors, NZ for non-quantized MOE, Qwen MOE double memory consumption fix~~ [Ascend]quantization: w4a4, compressed tensors, NZ for non-quantized MOE, Qwen3 MOE double memory consumption fix Dec 3, 2025

Move w8a8 test to srt, added new tests to run suite

3dcfc1b

This was referenced Dec 4, 2025

[NPU] [Roadmap] NPU quantization 2026 Q1 Roadmap #14424

Open

[NPU] NPU quantization refactoring & more quantization formats support #14504

Merged

OrangeRedeng closed this Dec 15, 2025

OrangeRedeng mentioned this pull request Jan 22, 2026

[NPU] NZ for non-quantized MOE, Qwen3 MOE double memory consumption fix #15904

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ascend]quantization: w4a4, compressed tensors, NZ for non-quantized MOE, Qwen3 MOE double memory consumption fix#11984

[Ascend]quantization: w4a4, compressed tensors, NZ for non-quantized MOE, Qwen3 MOE double memory consumption fix#11984
OrangeRedeng wants to merge 56 commits intosgl-project:mainfrom
OrangeRedeng:develop

OrangeRedeng commented Oct 22, 2025 •

edited

Loading

Uh oh!

TamirBaydasov commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

OrangeRedeng commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

TamirBaydasov commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

OrangeRedeng commented Oct 22, 2025 •

edited

Loading