Add nvfp4 quantizer files by zianglih · Pull Request #907 · radixark/miles

zianglih · 2026-04-06T04:10:40Z

Upload nvfp4 quantizer files from #546 for future NVFP4 integration.

Note, expert w1 and w3 share the same fp32 scales to match SGLang requirement: https://github.com/sgl-project/sglang/blob/c4240218cbf3656862f11748a4530da5dbc30f86/python/sglang/srt/layers/quantization/modelopt_quant.py#L1698-L1705

gemini-code-assist

Code Review

This pull request introduces NVFP4 (E2M1) quantization support for MoE expert GEMMs, featuring 1D block scaling and a conversion tool for Hugging Face checkpoints. The review identifies high-severity issues regarding potential division by zero in the quantization logic when processing zero-valued weight blocks, which could lead to NaN values. Other feedback points to potential memory exhaustion (OOM) during global scale collection on GPUs, significant code duplication between the processor and the conversion tool, and the need for better resource management using context managers for file I/O.

gemini-code-assist · 2026-04-06T04:13:35Z

+    encode_scale = torch.div(1.0, decode_scale.to(torch.float32) * global_decode_scale)
+    scaled = weight_blocks * encode_scale


Potential division by zero when a block of weights contains only zeros. If vec_max is 0, decode_scale becomes 0, leading to encode_scale being inf and scaled being NaN (since 0 * inf = NaN). This can be fixed by handling the zero case for decode_scale using torch.where.

Suggested change

encode_scale = torch.div(1.0, decode_scale.to(torch.float32) * global_decode_scale)

scaled = weight_blocks * encode_scale

decode_scale_f32 = decode_scale.to(torch.float32)

encode_scale = torch.where(decode_scale_f32 > 0, 1.0 / (decode_scale_f32 * global_decode_scale), torch.zeros_like(decode_scale_f32))

scaled = weight_blocks * encode_scale

gemini-code-assist · 2026-04-06T04:13:35Z

+    encode_scale = torch.div(1.0, decode_scale.to(torch.float32) * global_decode_scale)
+    scaled = weight_blocks * encode_scale


Potential division by zero when a block of weights contains only zeros. If vec_max is 0, decode_scale becomes 0, leading to encode_scale being inf and scaled being NaN (since 0 * inf = NaN). This can be fixed by handling the zero case for decode_scale using torch.where.

Suggested change

encode_scale = torch.div(1.0, decode_scale.to(torch.float32) * global_decode_scale)

scaled = weight_blocks * encode_scale

decode_scale_f32 = decode_scale.to(torch.float32)

encode_scale = torch.where(decode_scale_f32 > 0, 1.0 / (decode_scale_f32 * global_decode_scale), torch.zeros_like(decode_scale_f32))

scaled = weight_blocks * encode_scale

gemini-code-assist · 2026-04-06T04:13:35Z

+    config_path = os.path.join(model_dir, "config.json")
+    if not os.path.exists(config_path):
+        raise ValueError("config.json is required to use --keep-first-n or --keep-last-n.")
+    cfg = json.load(open(config_path))


Use a with statement when opening files to ensure they are properly closed, even if an exception occurs. This applies to other occurrences in this file as well (e.g., lines 463, 465, 473).

Suggested change

cfg = json.load(open(config_path))

with open(config_path) as f:

cfg = json.load(f)

References

To prevent resource leaks (e.g., counters that are not decremented), use constructs like try...finally or a with statement to ensure cleanup logic is always executed, even in the case of exceptions or early returns.

gemini-code-assist · 2026-04-06T04:13:35Z

+    return True
+
+
+def cast_to_fp4x2(x: torch.Tensor) -> torch.Tensor:


The quantization logic (cast_to_fp4x2, _quantize_nvfp4_1d, quantize_nvfp4) is duplicated between this file and miles/backends/megatron_utils/megatron_to_hf/processors/quantizer_nvfp4.py. This increases maintenance overhead and the risk of bugs. Consider moving these functions to a shared utility module.

gemini-code-assist · 2026-04-06T04:13:35Z

+    for filename in safetensors_files:
+        with safetensors.safe_open(os.path.join(input_path, filename), framework="pt", device=device) as f:
+            for key in f.keys():
+                tensor = f.get_tensor(key)


In _collect_shared_global_amax, tensors are loaded onto the specified device (which could be GPU) without explicit memory management. For large models, this can lead to Out-Of-Memory (OOM) errors as the PyTorch caching allocator might not release memory quickly enough. Consider adding del tensor and torch.cuda.empty_cache() inside the loop if the device is CUDA, or performing this calculation on CPU.

…region clusters (#10) * Revert "[BUGFIX] [P2PRDMA] Add rollout post-processing after P2PRDMA weight updates" (radixark#882) * [Fix] fix ci (radixark#894) * Avoid threading for ray getting object (radixark#886) * Add explicit errors for unsupported Megatron profiles (radixark#887) * Add nvfp4 quantizer files (radixark#907) * Bump flash-linear-attention version to 0.4.2 (radixark#892) * [BUGFIX] Invoke "post_process_quantization" by default after weight updating (radixark#890) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * Add heartbeat and id to session server (radixark#866) * fix: adding thin glm5 image to docker build + latest tag sync (radixark#871) * Add consistent hashing routing policy for rollout (radixark#891) Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> * [example] add retool v2 example with multi-turn framework interfaces (radixark#654) Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Expose rollout-batch-size, n-samples-per-prompt, global-batch-size as CLI args in swe-agent-v2 (radixark#954) Co-authored-by: Shi Dong <shi.dong@radixark.ai> * chore: remove obsolete swe-agent server.py and run-qwen3.sh (radixark#952) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add weight staleness control for fully async rollout (radixark#958) * Fix/pause generation mode (radixark#924) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * [v0.5.10][1] Bump sglang to v0.5.10 (radixark#898) * [v0.5.10][2] Fix apply_chat_template behavior for transformers >=5.0 (radixark#926) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][3] Fix processor return_tensors duplicate kwarg for transformers >=5.0 (radixark#927) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][4] Fix _no_split_modules set not subscriptable in transformers >=5.0 (radixark#931) * [v0.5.10][5] Disable piecewise cuda graph to avoid NVLS oom (radixark#935) * [v0.5.10][6][FSDP] fix outdated weight update logic in FSDP (radixark#948) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [v0.5.10][7][FSDP] move FSDP to experimental and disable by default (radixark#961) * Add skiplist and more robust calculation on val (radixark#965) * [fix] tiny fix debug rollout only in weight version check (radixark#967) * feat: real cp support with relayout fix for qwen3.5 train/rollout mismatch (radixark#885) * [AMD] Upgrade to sglv0.5.10 (radixark#973) * switch model to actor (radixark#756) * [fix] support general logic to bypass fp32 downcast and fix qwen35 A_log dtype (radixark#975) Co-authored-by: yueming-yuan <yym022502@gmail.com> * fix: populate prefix_cache_info in OpenAI/session rollout path (radixark#960) * Remove prepare_harbor_tasks.py; use harbor-private adapters (radixark#982) * [fix] Skip flush_cache in in_place mode and add fully async example (radixark#974) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * GLM47 full cmd for async and sync reasoning (radixark#986) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: handle non-tool appended messages in TITO incremental tokenization (radixark#949) Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> * [docker] Add sgl-model-gateway install and download .tar.gz assets (radixark#895) * [ci] fix hf rate limit error by caching tokenizer loading (radixark#1014) Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> * Use load_generate_function in legacy sglang_rollout path (radixark#1016) * Update CODEOWNERS to add new reviewers (radixark#1021) * Support moe lora for gpt-oss (radixark#798) Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> * [fix] restore expert_bias to fp32 before bridge weight export (radixark#811) * [chore] drop legacy transformers upgrade pin for glm47-flash and qwen35 (radixark#1018) * [fix] Enforce param dtype before wrap ddp (radixark#992) Co-authored-by: Zhichen Zeng <zczeng@uw.edu> * [upgrade] update Megatron-Bridge source and LoRA CI to megatron e2e tests and (radixark#1023) * [CI] Drop --use-miles-router from R3 tests and add r3 comparasion test between sgl & miles router (radixark#1015) * wandb: raise init_timeout, add retry wrapper, fix shared-mode init for cross-region clusters In online + shared mode, both `init_wandb_primary` and `init_wandb_secondary` make HTTPS round-trips to wandb cloud (login + run create/attach). On high-latency cross-region clusters (e.g. Abu Dhabi MBZUAI ↔ wandb-cloud US-West) with concurrent actor bursts, a single round-trip can exceed the wandb SDK's 90s default `init_timeout` — tearing down the whole run with a silent handshake abort. Observed on RL360 job 1564420, which forced `WANDB_MODE=offline` as a global default ever since (see https://github.com/LLM360/RL360/issues/87). The issue's original diagnosis assumed a local primary↔secondary socket handshake race. That's not how shared mode works — per wandb's own feature PR (wandb/wandb#6882), each writer spawns an independent wandb-core that talks to the cloud directly; aggregation is server-side by run_id. No local socket exists. The failure mode is pure network/latency, not a local readiness race. Changes ------- - Bump `init_timeout` to 300s for primary and secondary Settings. Configurable via `WANDB_INIT_TIMEOUT_SECS` env var for tuning. - Wrap both init paths in a bounded exponential-backoff retry (`_wandb_init_with_retry`) that re-attempts on wandb.errors.CommError and wandb.errors.UsageError. 3 attempts with 5→10→20s backoff by default, tunable via `WANDB_INIT_RETRY_ATTEMPTS` / `WANDB_INIT_RETRY_BACKOFF_SECS`. - Add `x_label` tagging per wandb distributed-training docs: primary gets `rank_<rank>_primary`, secondaries get `rank_<rank>_secondary`. Enables per-rank console-log filtering in the wandb UI. - Drop `reinit=True` from secondary init_kwargs. Shared mode natively supports concurrent writers on a single run; `reinit=True` triggered stale-state warnings on secondary actors without functional benefit. Followups this change enables ----------------------------- - `WANDB_MODE=offline` can be removed from scale.yaml's extra_env default once a pilot run confirms online mode boots cleanly. - The tmux-based `~/bin/wandb-sync-rl360.sh` workaround on David's M2 account becomes obsolete (no more offline-only default). - Near-realtime wandb dashboards replace the ~2-minute-lag offline sync; per-rank system metrics via x_label filtering. --------- Co-authored-by: JD <jaedon.guo@gmail.com> Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ziang Li <ziangli@umich.edu> Co-authored-by: Zhichen Zeng <zczeng@uw.edu> Co-authored-by: JensenFire <xinji1@microsoft.com> Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: Douglas Yang <douglasyang88@gmail.com> Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com> Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Shi-Dong <Shi-Dong@users.noreply.github.com> Co-authored-by: Shi Dong <shi.dong@radixark.ai> Co-authored-by: Jiajun Li <48857426+guapisolo@users.noreply.github.com> Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com> Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Yisheng Gong <yishenggong9437@gmail.com>

Add nvfp4 quantizer file from radixark#546

3b4ef62

zianglih requested review from fzyzcjy, maocheng23, yueming-yuan and yushengsu-thu as code owners April 6, 2026 04:10

yueming-yuan approved these changes Apr 6, 2026

View reviewed changes

yueming-yuan merged commit 649a353 into radixark:main Apr 6, 2026
1 check passed

gemini-code-assist Bot reviewed Apr 6, 2026

View reviewed changes

zianglih mentioned this pull request Apr 6, 2026

[Roadmap] Blackwell MXFP8 and NVFP4 RL training #615

Open

30 tasks

zianglih deleted the nvfp4-file branch April 6, 2026 08:07

guapisolo pushed a commit that referenced this pull request Apr 8, 2026

Add nvfp4 quantizer files (#907)

7ba5dd4

GuanxingLu pushed a commit to GuanxingLu/miles that referenced this pull request Apr 21, 2026

Add nvfp4 quantizer files (radixark#907)

56ee82d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nvfp4 quantizer files#907

Add nvfp4 quantizer files#907
yueming-yuan merged 1 commit intoradixark:mainfrom
zianglih:nvfp4-file

zianglih commented Apr 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		encode_scale = torch.div(1.0, decode_scale.to(torch.float32) * global_decode_scale)
		scaled = weight_blocks * encode_scale

	cfg = json.load(open(config_path))
	with open(config_path) as f:
	cfg = json.load(f)

		return True


		def cast_to_fp4x2(x: torch.Tensor) -> torch.Tensor:

Conversation

zianglih commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zianglih commented Apr 6, 2026 •

edited

Loading