[fix] Add detach() to fp32 param shard for leaf-tensor consistency by guapisolo · Pull Request #23 · radixark/Megatron-LM

guapisolo · 2026-04-14T05:19:51Z

Context

The float16/bf16 branch (line 372) already uses detach().

shard_model_param = model_param.detach().view(-1)[
    param_range.start : param_range.end
]

This change makes the fp32 branch consistent. The fp32 branch becomes relevant when parameters are intentionally kept in fp32 (e.g., Qwen3.5's A_log via enforce_marked_param_dtypes).

Why it won't break anything

detach() only disconnects the shard from the autograd graph (making it a leaf tensor). It does not copy data — the shard still shares the same underlying storage as model_param. Every downstream consumer of shard_fp32_groups uses manual data operations, never autograd:

Gradient flow (_copy_model_grads_to_main_grads): Reads from model_param.main_grad, slices it, and assigns to shard.grad via direct attribute assignment — autograd is not involved.

Parameter writeback (copy_main_params_to_model_params): Re-slices from the param buffer and calls shard_model_param.data.copy() — autograd is not involved.

Checkpoint load (copy_model_params_to_main_params): Re-slices from model_param.view(-1)[...] and calls shard_main_param.data.copy() — autograd is not involved.

zero_grad: Clears .grad attribute or sets it to None — autograd is not involved.

Optimizer step: PyTorch optimizer operates on shard.data and shard.grad directly — in fact, it prefers leaf tensors and may warn or error on non-leaf params.

In short: Megatron's DistributedOptimizer completely bypasses PyTorch autograd for all gradient and parameter movement. detach() is a no-op in terms of data and behavior — it only satisfies PyTorch's expectation that optimizer params are leaf tensors.

yueming-yuan · 2026-04-14T18:40:11Z

                # fp32 params.
                elif model_param.type() == 'torch.cuda.FloatTensor':
-                    shard_model_param = model_param.view(-1)[param_range.start : param_range.end]
+                    # Keep shard tensors as leaf tensors for torch Optimizer.


Why do this?

Why do this?

Attach reason to the pr description. BF16 already fix this but fp32 not

fix small issue when pytorch has bug as optimizer param

b8e982c

guapisolo mentioned this pull request Apr 14, 2026

[fix] support general logic to bypass fp32 downcast and fix qwen35 A_log dtype radixark/miles#975

Merged

yueming-yuan reviewed Apr 14, 2026

View reviewed changes

guapisolo changed the title ~~fix small issue when pytorch has bug as optimizer param~~ [fix] fp32 gradient flow warning Apr 14, 2026

yueming-yuan approved these changes Apr 14, 2026

View reviewed changes

guapisolo changed the title ~~[fix] fp32 gradient flow warning~~ [fix] Add detach() to fp32 param shard for leaf-tensor consistency Apr 14, 2026

guapisolo merged commit 32dbe9f into miles-main Apr 14, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Add detach() to fp32 param shard for leaf-tensor consistency#23

[fix] Add detach() to fp32 param shard for leaf-tensor consistency#23
guapisolo merged 1 commit intomiles-mainfrom
fix/fp32

guapisolo commented Apr 14, 2026 •

edited

Loading

Uh oh!

yueming-yuan Apr 14, 2026

Uh oh!

guapisolo Apr 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guapisolo commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Why it won't break anything

Uh oh!

yueming-yuan Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

guapisolo Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guapisolo commented Apr 14, 2026 •

edited

Loading

guapisolo Apr 14, 2026 •

edited

Loading