[fix] support general logic to bypass fp32 downcast and fix qwen35 A_log dtype by guapisolo · Pull Request #975 · radixark/miles

guapisolo · 2026-04-13T03:49:12Z

PR description generated by cc and reviewed by me.
radixark/Megatron-LM#23 also needed.

FP32 parameter preservation across Megatron's `Float16Module` wrap

Motivation

Qwen3.5's A_log must stay fp32 through hf → mcore conversion. If it gets rounded to bf16, Megatron-side no longer match sglang's fp32 A_log on the rollout side — precision drifts and the train/rollout pair is no longer equivalent. The previous patch_weight_to_mcore_format_preserve_fp32 monkey-patch looked correct but silently shipped bf16.

The two cast points

#	Location	What it casts
1	`megatron/core/transformer/module.py:440` — `Float16Module` ctor runs `module.bfloat16()`	Every `nn.Parameter.data` → bf16 at wrap time
2	`mbridge/core/bridge.py:842` — `w.to(self.dtype)` in `_weight_to_mcore_format`	HF tensor → Bridge's `self.dtype` (bf16)
3	`mbridge/core/bridge.py:246` — `t.to(param.device, dtype=param.dtype)` in `load_weights`	mcore tensor → Megatron `param.dtype` (bf16, due to cast 1)

Cast 1 has no declarative opt-out in nn.Module or Megatron — even Megatron's own _maintain_float32_expert_bias (moe/router.py:209-218) uses a post-hoc .data.to(float32) workaround.

Why the old patch was insufficient

patch_weight_to_mcore_format_preserve_fp32 set self.dtype = None during _weight_to_mcore_format, which handled cast 2 only. It had no handle on cast 1 (Megatron-side) and missed cast 3 entirely — cast 3 reads param.dtype, which is bf16 because cast 1 already ran. End-to-end, A_log still shipped as bf16.

The replacement — two pieces, each closing one cast point

Downstream (miles/backends/megatron_utils/fp32_param_utils.py): mark_param_dtype(param, dtype) tags intent at the model definition site; enforce_marked_param_dtypes(model) is called right after get_model (in both miles/backends/megatron_utils/model.py:129 and tools/convert_hf_to_torch_dist.py:113) and re-casts tagged params' .data back to fp32, preserving Parameter identity so the optimizer and DDP bucketing registered afterwards see stable tensors.

This closes cast 1 and cast 3 in one step — once param.dtype == fp32, the bridge.py:246 in-place cast is a no-op for tagged params.

Upstream (miles_plugins/mbridge/qwen3_5.py:254-261): subclass override in Qwen3_5Bridge._weight_to_mcore_format early-returns hf_weights[0].to(torch.float32).contiguous() for any name ending in self_attention.linear_attn.A_log (matches MTP layers too). This closes cast 2.

Why both pieces are required

From tools/debug_a_log_old_flow.py:

Config	mcore_weight	After `param.dtype` cast	Final dtype	Value-accurate
Nothing	bf16	bf16	bf16	✗
Old patch only	fp32	bf16	bf16	✗
Enforce only (no mbridge override)	bf16	fp32 up-cast	fp32	✗ — value rounded at cast 2
Both (current)	fp32	fp32	fp32	✓

The enforce-only row is the trap: dtype is fp32 but the bits were already rounded at cast 2 and up-cast back into an fp32 container. Only a value-level comparison against the HF source catches it — see test_old_patch_only_regresses_without_enforce in the test file.

Extending to other params

At the model definition site: mark_param_dtype(self.X, torch.float32) right after self.X = nn.Parameter(...).
If HF → mcore must also skip mbridge's self.dtype pre-cast (i.e. the HF ckpt ships it fp32 and you can't afford rounding at cast 2), add a name-matched early-return in the relevant Bridge subclass's _weight_to_mcore_format.

No changes to enforce_marked_param_dtypes, Megatron, or mbridge base.

Tests

tests/fast/backends/megatron_utils/test_fp32_param_utils.py — 13 CPU-only tests covering the downstream helper, the upstream bridge override, and a bit-exact end-to-end round-trip plus the negative regression guard against the old patch's failure mode.

Rewrite fp32_param_preservation.md as a practical how-to guide for supporting fp32 parameters in bf16 models. Leads with a 2-step quick start, uses Qwen3.5 A_log as a complete example, and keeps the cast chain analysis as background. Register the doc in index.rst. Made-with: Cursor

…log dtype (radixark#975) Co-authored-by: yueming-yuan <yym022502@gmail.com>

…region clusters (#10) * Revert "[BUGFIX] [P2PRDMA] Add rollout post-processing after P2PRDMA weight updates" (radixark#882) * [Fix] fix ci (radixark#894) * Avoid threading for ray getting object (radixark#886) * Add explicit errors for unsupported Megatron profiles (radixark#887) * Add nvfp4 quantizer files (radixark#907) * Bump flash-linear-attention version to 0.4.2 (radixark#892) * [BUGFIX] Invoke "post_process_quantization" by default after weight updating (radixark#890) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * Add heartbeat and id to session server (radixark#866) * fix: adding thin glm5 image to docker build + latest tag sync (radixark#871) * Add consistent hashing routing policy for rollout (radixark#891) Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> * [example] add retool v2 example with multi-turn framework interfaces (radixark#654) Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Expose rollout-batch-size, n-samples-per-prompt, global-batch-size as CLI args in swe-agent-v2 (radixark#954) Co-authored-by: Shi Dong <shi.dong@radixark.ai> * chore: remove obsolete swe-agent server.py and run-qwen3.sh (radixark#952) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add weight staleness control for fully async rollout (radixark#958) * Fix/pause generation mode (radixark#924) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * [v0.5.10][1] Bump sglang to v0.5.10 (radixark#898) * [v0.5.10][2] Fix apply_chat_template behavior for transformers >=5.0 (radixark#926) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][3] Fix processor return_tensors duplicate kwarg for transformers >=5.0 (radixark#927) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][4] Fix _no_split_modules set not subscriptable in transformers >=5.0 (radixark#931) * [v0.5.10][5] Disable piecewise cuda graph to avoid NVLS oom (radixark#935) * [v0.5.10][6][FSDP] fix outdated weight update logic in FSDP (radixark#948) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [v0.5.10][7][FSDP] move FSDP to experimental and disable by default (radixark#961) * Add skiplist and more robust calculation on val (radixark#965) * [fix] tiny fix debug rollout only in weight version check (radixark#967) * feat: real cp support with relayout fix for qwen3.5 train/rollout mismatch (radixark#885) * [AMD] Upgrade to sglv0.5.10 (radixark#973) * switch model to actor (radixark#756) * [fix] support general logic to bypass fp32 downcast and fix qwen35 A_log dtype (radixark#975) Co-authored-by: yueming-yuan <yym022502@gmail.com> * fix: populate prefix_cache_info in OpenAI/session rollout path (radixark#960) * Remove prepare_harbor_tasks.py; use harbor-private adapters (radixark#982) * [fix] Skip flush_cache in in_place mode and add fully async example (radixark#974) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * GLM47 full cmd for async and sync reasoning (radixark#986) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: handle non-tool appended messages in TITO incremental tokenization (radixark#949) Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> * [docker] Add sgl-model-gateway install and download .tar.gz assets (radixark#895) * [ci] fix hf rate limit error by caching tokenizer loading (radixark#1014) Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> * Use load_generate_function in legacy sglang_rollout path (radixark#1016) * Update CODEOWNERS to add new reviewers (radixark#1021) * Support moe lora for gpt-oss (radixark#798) Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> * [fix] restore expert_bias to fp32 before bridge weight export (radixark#811) * [chore] drop legacy transformers upgrade pin for glm47-flash and qwen35 (radixark#1018) * [fix] Enforce param dtype before wrap ddp (radixark#992) Co-authored-by: Zhichen Zeng <zczeng@uw.edu> * [upgrade] update Megatron-Bridge source and LoRA CI to megatron e2e tests and (radixark#1023) * [CI] Drop --use-miles-router from R3 tests and add r3 comparasion test between sgl & miles router (radixark#1015) * wandb: raise init_timeout, add retry wrapper, fix shared-mode init for cross-region clusters In online + shared mode, both `init_wandb_primary` and `init_wandb_secondary` make HTTPS round-trips to wandb cloud (login + run create/attach). On high-latency cross-region clusters (e.g. Abu Dhabi MBZUAI ↔ wandb-cloud US-West) with concurrent actor bursts, a single round-trip can exceed the wandb SDK's 90s default `init_timeout` — tearing down the whole run with a silent handshake abort. Observed on RL360 job 1564420, which forced `WANDB_MODE=offline` as a global default ever since (see https://github.com/LLM360/RL360/issues/87). The issue's original diagnosis assumed a local primary↔secondary socket handshake race. That's not how shared mode works — per wandb's own feature PR (wandb/wandb#6882), each writer spawns an independent wandb-core that talks to the cloud directly; aggregation is server-side by run_id. No local socket exists. The failure mode is pure network/latency, not a local readiness race. Changes ------- - Bump `init_timeout` to 300s for primary and secondary Settings. Configurable via `WANDB_INIT_TIMEOUT_SECS` env var for tuning. - Wrap both init paths in a bounded exponential-backoff retry (`_wandb_init_with_retry`) that re-attempts on wandb.errors.CommError and wandb.errors.UsageError. 3 attempts with 5→10→20s backoff by default, tunable via `WANDB_INIT_RETRY_ATTEMPTS` / `WANDB_INIT_RETRY_BACKOFF_SECS`. - Add `x_label` tagging per wandb distributed-training docs: primary gets `rank_<rank>_primary`, secondaries get `rank_<rank>_secondary`. Enables per-rank console-log filtering in the wandb UI. - Drop `reinit=True` from secondary init_kwargs. Shared mode natively supports concurrent writers on a single run; `reinit=True` triggered stale-state warnings on secondary actors without functional benefit. Followups this change enables ----------------------------- - `WANDB_MODE=offline` can be removed from scale.yaml's extra_env default once a pilot run confirms online mode boots cleanly. - The tmux-based `~/bin/wandb-sync-rl360.sh` workaround on David's M2 account becomes obsolete (no more offline-only default). - Near-realtime wandb dashboards replace the ~2-minute-lag offline sync; per-rank system metrics via x_label filtering. --------- Co-authored-by: JD <jaedon.guo@gmail.com> Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ziang Li <ziangli@umich.edu> Co-authored-by: Zhichen Zeng <zczeng@uw.edu> Co-authored-by: JensenFire <xinji1@microsoft.com> Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: Douglas Yang <douglasyang88@gmail.com> Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com> Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Shi-Dong <Shi-Dong@users.noreply.github.com> Co-authored-by: Shi Dong <shi.dong@radixark.ai> Co-authored-by: Jiajun Li <48857426+guapisolo@users.noreply.github.com> Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com> Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Yisheng Gong <yishenggong9437@gmail.com>

guapisolo requested review from fzyzcjy, maocheng23, yueming-yuan and yushengsu-thu as code owners April 13, 2026 03:49

yueming-yuan approved these changes Apr 14, 2026

View reviewed changes

guapisolo added 2 commits April 14, 2026 19:17

fix qwen35 and support general fp32

178d836

cmt

2269127

guapisolo force-pushed the fix/qwen35 branch 2 times, most recently from b96a400 to 87930ef Compare April 14, 2026 19:17

guapisolo force-pushed the fix/qwen35 branch from 87930ef to a6b175e Compare April 14, 2026 19:23

change docs“

f37cfee

guapisolo merged commit 85fe651 into main Apr 14, 2026
17 checks passed

guapisolo deleted the fix/qwen35 branch April 14, 2026 20:08

guapisolo mentioned this pull request Apr 16, 2026

[fix] Enforce param dtype before wrap ddp #992

Merged

GuanxingLu pushed a commit to GuanxingLu/miles that referenced this pull request Apr 21, 2026

[fix] support general logic to bypass fp32 downcast and fix qwen35 A_…

3e9e4a2

…log dtype (radixark#975) Co-authored-by: yueming-yuan <yym022502@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] support general logic to bypass fp32 downcast and fix qwen35 A_log dtype#975

[fix] support general logic to bypass fp32 downcast and fix qwen35 A_log dtype#975
guapisolo merged 4 commits intomainfrom
fix/qwen35

guapisolo commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guapisolo commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

FP32 parameter preservation across Megatron's Float16Module wrap

Motivation

The two cast points

Why the old patch was insufficient

The replacement — two pieces, each closing one cast point

Why both pieces are required

Extending to other params

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guapisolo commented Apr 13, 2026 •

edited

Loading

FP32 parameter preservation across Megatron's `Float16Module` wrap