[Nvidia]1/N Modelopt Quantization Refactorization by wenscarl · Pull Request #20963 · sgl-project/sglang

wenscarl · 2026-03-20T01:55:50Z

Summary

Refactors ModelOpt quantization into a package layout aligned with the roadmap in #15194: a top-level config module, shared utils, and a schemes/ subtree for recipe-specific configs and methods.

This is a refactor only—intended behavior and the public surface of sglang.srt.layers.quantization.modelopt stay the same for normal use.
cc. @Edwardf0t1 ; @Fridge003

Motivation

Easier navigation and review (FP8 / FP4 / mixed precision live in dedicated scheme modules).
Structural parity with compressed_tensors (compressed_tensors.py + utils.py + schemes/).

What changed

Area	Change
`modelopt/modelopt.py`	`ModelOptQuantConfig` only; lazy import for `ModelOptFp8KVCacheMethod` on the KV path to avoid import cycles.
`modelopt/utils.py`	NVFP4 padding / alignment helpers (`FP4_GEMM_ALIGNMENT`, `pad_nvfp4_*`, `slice_nvfp4_output`, …).
`modelopt/schemes/modelopt_fp8.py`	FP8 config, linear, KV cache, MoE.
`modelopt/schemes/modelopt_fp4.py`	FP4 bootstrap (`fp4_quantize`, `fp4_gemm`, …), FP4 config, linear, MoE.
`modelopt/schemes/modelopt_mixed_precision.py`	Mixed-precision config.
`modelopt/schemes/__init__.py`	Re-exports scheme symbols.
`modelopt/__init__.py`	Package public API; re-exports `fp4_gemm`, `fp4_quantize`, `enable_flashinfer_fp4_gemm` for CT NVFP4.
Monolithic `impl.py`	Removed (content moved, not rewritten).

Related import fix

compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py now imports FP4 helpers from sglang.srt.layers.quantization.modelopt instead of the old modelopt_quant module path.

Non-goals

No new quantization recipes, kernels, or checkpoint formats.
No intentional breaking changes for code that already imports from sglang.srt.layers.quantization.modelopt.

How to test

Pre-commit / Ruff on touched paths.
Smoke: import sglang.srt.layers.quantization.modelopt and quantization registry entries for modelopt_fp8 / modelopt_fp4 / modelopt_mixed.
Optional: existing ModelOpt / NVFP4 / FP8 tests if available in CI.

Risk / rollback

Low—file split and import wiring only. Rollback is a single revert; no weight-format or config-schema changes intended.

Accuracy:

nvfp4:

python3 -m sglang.launch_server \
  --model-path nvidia/DeepSeek-R1-0528-FP4-v2 \
  --trust-remote-code \
  --disable-radix-cache \
  --kv-cache-dtype fp8_e4m3 \
  --max-running-requests 256 \
  --disable-cuda-graph \
  --chunked-prefill-size 2048 \
  --mem-fraction-static 0.89 \
  --max-prefill-tokens 16384 \
  --tp 4 \
  --dp 4 \
  --ep 4 \
  --enable-dp-attention \
  --quantization modelopt_fp4 \
  --moe-runner-backend flashinfer_cutlass/flashinfer_trtllm \

GSM8k:

python3 benchmark/gsm8k/bench_sglang.py \
  --num-shots 8 \
  --num-questions 1316 \
  --parallel 1316

flashinfer_cutlass:

Accuracy: 0.960
Invalid: 0.000
Latency: 381.147 s
Output throughput: 371.731 token/s

flashinfer_trtllm:

Accuracy: 0.951
Invalid: 0.000
Latency: 403.069 s
Output throughput: 366.607 token/s

flashinfer_cutedsl: MOE_NVFP4_DISPATCH=1, DEEPEP_MAX_NUM_DISPATCH_TOKENS_PER_TOKEN: 512 + deepep_low_latency

Accuracy: 0.950
Invalid: 0.000
Latency: 349.456 s
Output throughput: 417.849 token/s

gemini-code-assist · 2026-03-20T01:55:55Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-03-20T17:35:58Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Edwardf0t1

Thanks @wenscarl for getting this started - left a few comments.

In addition, can you check if modelopt quantization + sglang runtime workflow still works as described in this blog:
https://www.lmsys.org/blog/2025-12-02-modelopt-quantization/

- Move implementation to layers/quantization/modelopt/impl.py - Re-export public API from layers/quantization/modelopt/__init__.py - Update imports in quantization registry, weight_utils, fused_moe, tests No functional changes. Made-with: Cursor refactor init

**Upstream status** as of 2026-04-06: - Qwen3.5: fixed via [PR #19767](sgl-project/sglang#19767) (merged 2026-03-09, included in v0.5.10) - Qwen3: [PR #21461](sgl-project/sglang#21461) — closed without merge 2026-03-30 (CI failure), superseded by #21822 - Qwen3: [PR #21822](sgl-project/sglang#21822) — new fix opened 2026-03-26, addresses `AttributeError: 'LazyValue' object has no attribute 'keys'` in `eplb_manager.py` for Qwen3 MoE. Code review 2026-04-04 by `Fridge003` and `Evgueni-Petrov-aka-espetrov`. Alternative `LazyValue.__getattr__` approach proposed (avoids modifying the model class). **Approved** by `Fridge003` on 2026-04-06, CI rerun triggered — awaiting merge. (Duplicate [PR #21820](sgl-project/sglang#21820) was closed same day in favour of #21822.) Not in v0.5.10 When `--enable-eplb` is active with EP, the `EPLBManager` crashes after its first rebalance interval (default: 1000 forward passes): - SGLang PR #17137 — non-Marlin WNA16MoE port (does not fix EP bug) - SGLang #14158 — update_weights_from_tensor for WNA16MoE (unrelated) - SGLang [PR #13715](sgl-project/sglang#13715) — fix EPLB + FP4 weight tensor filtering (merged, different issue) - SGLang [PR #20963](sgl-project/sglang#20963) — Nvidia modelopt refactoring (1/N). Under active review: reviewer `Edwardf0t1` asked for end-to-end verification 2026-03-31, author `wenscarl` responded 2026-04-01 and posted 3 further inline review responses 2026-04-06. Not stalled but awaiting approval. Migrates the NVFP4 code as-is — expected vehicle for EP-awareness fixes (#20869, #21630). Watch this PR for resolution of the NVFP4 input_scale and CutlassMoEParams bugs - SGLang [PR #21822](sgl-project/sglang#21822) — new EPLB/Qwen3 fix (opened 2026-03-26). Addresses `LazyValue.keys()` AttributeError. Code review 2026-04-04 by `Fridge003` and `Evgueni-Petrov-aka-espetrov`. Alternative `LazyValue.__getattr__` approach proposed. **Approved** by `Fridge003` on 2026-04-06, CI rerun triggered — awaiting merge "Good code is like humor: when you have to explain it, it’s bad." - Cory House P.S.: Code reviews and approvals are crucial for maintaining high-quality software.

github-actions Bot added the blackwell SM100/SM120 label Mar 20, 2026

wenscarl changed the title ~~[Nvidia] Modelopt Quantization Refactorization:~~ [Nvidia] Modelopt Quantization Refactorization Mar 20, 2026

wenscarl marked this pull request as ready for review March 20, 2026 17:35

wenscarl requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, b8zhong, ch-wan, ispobock and merrymercy as code owners March 20, 2026 17:35

wenscarl force-pushed the modelopt_refactor branch from 11f3aed to af728cc Compare March 20, 2026 19:37

wenscarl mentioned this pull request Mar 20, 2026

[Nvidia] 2/N Support W4A16 AWQ scheme from ModelOpt #21053

Draft

TamirBaydasov mentioned this pull request Mar 23, 2026

[Roadmap] Quantization Modifications #15194

Open

27 tasks

wenscarl mentioned this pull request Mar 23, 2026

[Nvidia] 3/N ModelOpt to Compressed-Tensors bridge: FP8 PCPT + FP8 block #19101

Draft

wenscarl changed the title ~~[Nvidia] Modelopt Quantization Refactorization~~ [Nvidia]1/N Modelopt Quantization Refactorization Mar 23, 2026

wenscarl force-pushed the modelopt_refactor branch from 7bfe535 to 578ca2f Compare March 24, 2026 19:39

vroomfondel mentioned this pull request Mar 28, 2026

[Bug] modelopt_quant.py: NVFP4 input_scale not sliced to local experts with EP > 1 #21602

Open

Edwardf0t1 reviewed Mar 31, 2026

View reviewed changes

wenscarl force-pushed the modelopt_refactor branch from 578ca2f to 5c9f997 Compare April 6, 2026 19:20

github-actions Bot added the quant LLM Quantization label Apr 6, 2026

wenscarl requested a review from Edwardf0t1 April 6, 2026 19:20

Improved based on review 1.

7c8cd9e

wenscarl force-pushed the modelopt_refactor branch from 5c9f997 to 7c8cd9e Compare April 6, 2026 19:35

juhi10071998 mentioned this pull request Apr 20, 2026

[RFC]: Unified ModelOpt Quantization in vLLM vllm-project/vllm#40182

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Nvidia]1/N Modelopt Quantization Refactorization#20963

[Nvidia]1/N Modelopt Quantization Refactorization#20963
wenscarl wants to merge 2 commits intosgl-project:mainfrom
wenscarl:modelopt_refactor

wenscarl commented Mar 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Uh oh!

Edwardf0t1 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wenscarl commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What changed

Related import fix

Non-goals

How to test

Risk / rollback

Accuracy:

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wenscarl commented Mar 20, 2026 •

edited

Loading