[Nvidia]1/N Modelopt Quantization Refactorization#20963
Open
wenscarl wants to merge 2 commits intosgl-project:mainfrom
Open
[Nvidia]1/N Modelopt Quantization Refactorization#20963wenscarl wants to merge 2 commits intosgl-project:mainfrom
wenscarl wants to merge 2 commits intosgl-project:mainfrom
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
11f3aed to
af728cc
Compare
27 tasks
7bfe535 to
578ca2f
Compare
Edwardf0t1
reviewed
Mar 31, 2026
Collaborator
Edwardf0t1
left a comment
There was a problem hiding this comment.
Thanks @wenscarl for getting this started - left a few comments.
In addition, can you check if modelopt quantization + sglang runtime workflow still works as described in this blog:
https://www.lmsys.org/blog/2025-12-02-modelopt-quantization/
- Move implementation to layers/quantization/modelopt/impl.py - Re-export public API from layers/quantization/modelopt/__init__.py - Update imports in quantization registry, weight_utils, fused_moe, tests No functional changes. Made-with: Cursor refactor init
578ca2f to
5c9f997
Compare
5c9f997 to
7c8cd9e
Compare
vroomfondel
added a commit
to vroomfondel/dgxarley
that referenced
this pull request
Apr 7, 2026
**Upstream status** as of 2026-04-06: - Qwen3.5: fixed via [PR #19767](sgl-project/sglang#19767) (merged 2026-03-09, included in v0.5.10) - Qwen3: [PR #21461](sgl-project/sglang#21461) — closed without merge 2026-03-30 (CI failure), superseded by #21822 - Qwen3: [PR #21822](sgl-project/sglang#21822) — new fix opened 2026-03-26, addresses `AttributeError: 'LazyValue' object has no attribute 'keys'` in `eplb_manager.py` for Qwen3 MoE. Code review 2026-04-04 by `Fridge003` and `Evgueni-Petrov-aka-espetrov`. Alternative `LazyValue.__getattr__` approach proposed (avoids modifying the model class). **Approved** by `Fridge003` on 2026-04-06, CI rerun triggered — awaiting merge. (Duplicate [PR #21820](sgl-project/sglang#21820) was closed same day in favour of #21822.) Not in v0.5.10 When `--enable-eplb` is active with EP, the `EPLBManager` crashes after its first rebalance interval (default: 1000 forward passes): - SGLang PR #17137 — non-Marlin WNA16MoE port (does not fix EP bug) - SGLang #14158 — update_weights_from_tensor for WNA16MoE (unrelated) - SGLang [PR #13715](sgl-project/sglang#13715) — fix EPLB + FP4 weight tensor filtering (merged, different issue) - SGLang [PR #20963](sgl-project/sglang#20963) — Nvidia modelopt refactoring (1/N). Under active review: reviewer `Edwardf0t1` asked for end-to-end verification 2026-03-31, author `wenscarl` responded 2026-04-01 and posted 3 further inline review responses 2026-04-06. Not stalled but awaiting approval. Migrates the NVFP4 code as-is — expected vehicle for EP-awareness fixes (#20869, #21630). Watch this PR for resolution of the NVFP4 input_scale and CutlassMoEParams bugs - SGLang [PR #21822](sgl-project/sglang#21822) — new EPLB/Qwen3 fix (opened 2026-03-26). Addresses `LazyValue.keys()` AttributeError. Code review 2026-04-04 by `Fridge003` and `Evgueni-Petrov-aka-espetrov`. Alternative `LazyValue.__getattr__` approach proposed. **Approved** by `Fridge003` on 2026-04-06, CI rerun triggered — awaiting merge "Good code is like humor: when you have to explain it, it’s bad." - Cory House P.S.: Code reviews and approvals are crucial for maintaining high-quality software.
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactors ModelOpt quantization into a package layout aligned with the roadmap in #15194: a top-level config module, shared
utils, and aschemes/subtree for recipe-specific configs and methods.This is a refactor only—intended behavior and the public surface of
sglang.srt.layers.quantization.modeloptstay the same for normal use.cc. @Edwardf0t1 ; @Fridge003
Motivation
compressed_tensors(compressed_tensors.py+utils.py+schemes/).What changed
modelopt/modelopt.pyModelOptQuantConfigonly; lazy import forModelOptFp8KVCacheMethodon the KV path to avoid import cycles.modelopt/utils.pyFP4_GEMM_ALIGNMENT,pad_nvfp4_*,slice_nvfp4_output, …).modelopt/schemes/modelopt_fp8.pymodelopt/schemes/modelopt_fp4.pyfp4_quantize,fp4_gemm, …), FP4 config, linear, MoE.modelopt/schemes/modelopt_mixed_precision.pymodelopt/schemes/__init__.pymodelopt/__init__.pyfp4_gemm,fp4_quantize,enable_flashinfer_fp4_gemmfor CT NVFP4.impl.pyRelated import fix
compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.pynow imports FP4 helpers fromsglang.srt.layers.quantization.modeloptinstead of the oldmodelopt_quantmodule path.Non-goals
sglang.srt.layers.quantization.modelopt.How to test
import sglang.srt.layers.quantization.modeloptand quantization registry entries formodelopt_fp8/modelopt_fp4/modelopt_mixed.Risk / rollback
Low—file split and import wiring only. Rollback is a single revert; no weight-format or config-schema changes intended.
Accuracy:
nvfp4:
GSM8k:
flashinfer_cutlass:
flashinfer_trtllm:
flashinfer_cutedsl: MOE_NVFP4_DISPATCH=1, DEEPEP_MAX_NUM_DISPATCH_TOKENS_PER_TOKEN: 512 + deepep_low_latency