Skip to content

[Nvidia]1/N Modelopt Quantization Refactorization#20963

Open
wenscarl wants to merge 2 commits intosgl-project:mainfrom
wenscarl:modelopt_refactor
Open

[Nvidia]1/N Modelopt Quantization Refactorization#20963
wenscarl wants to merge 2 commits intosgl-project:mainfrom
wenscarl:modelopt_refactor

Conversation

@wenscarl
Copy link
Copy Markdown
Collaborator

@wenscarl wenscarl commented Mar 20, 2026

Summary

Refactors ModelOpt quantization into a package layout aligned with the roadmap in #15194: a top-level config module, shared utils, and a schemes/ subtree for recipe-specific configs and methods.

This is a refactor only—intended behavior and the public surface of sglang.srt.layers.quantization.modelopt stay the same for normal use.
cc. @Edwardf0t1 ; @Fridge003

Motivation

  • Easier navigation and review (FP8 / FP4 / mixed precision live in dedicated scheme modules).
  • Structural parity with compressed_tensors (compressed_tensors.py + utils.py + schemes/).

What changed

Area Change
modelopt/modelopt.py ModelOptQuantConfig only; lazy import for ModelOptFp8KVCacheMethod on the KV path to avoid import cycles.
modelopt/utils.py NVFP4 padding / alignment helpers (FP4_GEMM_ALIGNMENT, pad_nvfp4_*, slice_nvfp4_output, …).
modelopt/schemes/modelopt_fp8.py FP8 config, linear, KV cache, MoE.
modelopt/schemes/modelopt_fp4.py FP4 bootstrap (fp4_quantize, fp4_gemm, …), FP4 config, linear, MoE.
modelopt/schemes/modelopt_mixed_precision.py Mixed-precision config.
modelopt/schemes/__init__.py Re-exports scheme symbols.
modelopt/__init__.py Package public API; re-exports fp4_gemm, fp4_quantize, enable_flashinfer_fp4_gemm for CT NVFP4.
Monolithic impl.py Removed (content moved, not rewritten).

Related import fix

  • compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py now imports FP4 helpers from sglang.srt.layers.quantization.modelopt instead of the old modelopt_quant module path.

Non-goals

  • No new quantization recipes, kernels, or checkpoint formats.
  • No intentional breaking changes for code that already imports from sglang.srt.layers.quantization.modelopt.

How to test

  • Pre-commit / Ruff on touched paths.
  • Smoke: import sglang.srt.layers.quantization.modelopt and quantization registry entries for modelopt_fp8 / modelopt_fp4 / modelopt_mixed.
  • Optional: existing ModelOpt / NVFP4 / FP8 tests if available in CI.

Risk / rollback

Low—file split and import wiring only. Rollback is a single revert; no weight-format or config-schema changes intended.

Accuracy:

nvfp4:

python3 -m sglang.launch_server \
  --model-path nvidia/DeepSeek-R1-0528-FP4-v2 \
  --trust-remote-code \
  --disable-radix-cache \
  --kv-cache-dtype fp8_e4m3 \
  --max-running-requests 256 \
  --disable-cuda-graph \
  --chunked-prefill-size 2048 \
  --mem-fraction-static 0.89 \
  --max-prefill-tokens 16384 \
  --tp 4 \
  --dp 4 \
  --ep 4 \
  --enable-dp-attention \
  --quantization modelopt_fp4 \
  --moe-runner-backend flashinfer_cutlass/flashinfer_trtllm \

GSM8k:

python3 benchmark/gsm8k/bench_sglang.py \
  --num-shots 8 \
  --num-questions 1316 \
  --parallel 1316

flashinfer_cutlass:

Accuracy: 0.960
Invalid: 0.000
Latency: 381.147 s
Output throughput: 371.731 token/s

flashinfer_trtllm:

Accuracy: 0.951
Invalid: 0.000
Latency: 403.069 s
Output throughput: 366.607 token/s

flashinfer_cutedsl: MOE_NVFP4_DISPATCH=1, DEEPEP_MAX_NUM_DISPATCH_TOKENS_PER_TOKEN: 512 + deepep_low_latency

Accuracy: 0.950
Invalid: 0.000
Latency: 349.456 s
Output throughput: 417.849 token/s

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the blackwell SM100/SM120 label Mar 20, 2026
@wenscarl wenscarl changed the title [Nvidia] Modelopt Quantization Refactorization: [Nvidia] Modelopt Quantization Refactorization Mar 20, 2026
@wenscarl wenscarl marked this pull request as ready for review March 20, 2026 17:35
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Collaborator

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wenscarl for getting this started - left a few comments.

In addition, can you check if modelopt quantization + sglang runtime workflow still works as described in this blog:
https://www.lmsys.org/blog/2025-12-02-modelopt-quantization/

Comment thread python/sglang/srt/layers/moe/fused_moe_triton/layer.py
Comment thread python/sglang/srt/layers/quantization/modelopt_quant.py
Comment thread python/sglang/srt/layers/quantization/modelopt_quant.py
Comment thread python/sglang/srt/layers/quantization/modelopt/schemes/__init__.py Outdated
Comment thread python/sglang/srt/layers/quantization/modelopt/modelopt.py
- Move implementation to layers/quantization/modelopt/impl.py
- Re-export public API from layers/quantization/modelopt/__init__.py
- Update imports in quantization registry, weight_utils, fused_moe, tests

No functional changes.

Made-with: Cursor

refactor init
@wenscarl wenscarl force-pushed the modelopt_refactor branch from 578ca2f to 5c9f997 Compare April 6, 2026 19:20
@github-actions github-actions Bot added the quant LLM Quantization label Apr 6, 2026
@wenscarl wenscarl requested a review from Edwardf0t1 April 6, 2026 19:20
@wenscarl wenscarl force-pushed the modelopt_refactor branch from 5c9f997 to 7c8cd9e Compare April 6, 2026 19:35
vroomfondel added a commit to vroomfondel/dgxarley that referenced this pull request Apr 7, 2026
**Upstream status** as of 2026-04-06:
- Qwen3.5: fixed via [PR #19767](sgl-project/sglang#19767) (merged 2026-03-09, included in v0.5.10)
- Qwen3: [PR #21461](sgl-project/sglang#21461) — closed without merge 2026-03-30 (CI failure), superseded by #21822
- Qwen3: [PR #21822](sgl-project/sglang#21822) — new fix opened 2026-03-26, addresses `AttributeError: 'LazyValue' object has no attribute 'keys'` in `eplb_manager.py` for Qwen3 MoE. Code review 2026-04-04 by `Fridge003` and `Evgueni-Petrov-aka-espetrov`. Alternative `LazyValue.__getattr__` approach proposed (avoids modifying the model class). **Approved** by `Fridge003` on 2026-04-06, CI rerun triggered — awaiting merge. (Duplicate [PR #21820](sgl-project/sglang#21820) was closed same day in favour of #21822.) Not in v0.5.10

When `--enable-eplb` is active with EP, the `EPLBManager` crashes after its first rebalance interval (default: 1000 forward passes):
- SGLang PR #17137 — non-Marlin WNA16MoE port (does not fix EP bug)
- SGLang #14158 — update_weights_from_tensor for WNA16MoE (unrelated)
- SGLang [PR #13715](sgl-project/sglang#13715) — fix EPLB + FP4 weight tensor filtering (merged, different issue)
- SGLang [PR #20963](sgl-project/sglang#20963) — Nvidia modelopt refactoring (1/N). Under active review: reviewer `Edwardf0t1` asked for end-to-end verification 2026-03-31, author `wenscarl` responded 2026-04-01 and posted 3 further inline review responses 2026-04-06. Not stalled but awaiting approval. Migrates the NVFP4 code as-is — expected vehicle for EP-awareness fixes (#20869, #21630). Watch this PR for resolution of the NVFP4 input_scale and CutlassMoEParams bugs
- SGLang [PR #21822](sgl-project/sglang#21822) — new EPLB/Qwen3 fix (opened 2026-03-26). Addresses `LazyValue.keys()` AttributeError. Code review 2026-04-04 by `Fridge003` and `Evgueni-Petrov-aka-espetrov`. Alternative `LazyValue.__getattr__` approach proposed. **Approved** by `Fridge003` on 2026-04-06, CI rerun triggered — awaiting merge

"Good code is like humor: when you have to explain it, it’s bad." - Cory House
P.S.: Code reviews and approvals are crucial for maintaining high-quality software.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blackwell SM100/SM120 quant LLM Quantization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants