M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs#4181
Conversation
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
| self.calculate_per_token_loss = calculate_per_token_loss | ||
| self.init_model_with_meta_device = init_model_with_meta_device | ||
| self.enable_fine_grained_param_gather_hook = enable_fine_grained_param_gather_hook | ||
| self.enable_fine_grained_param_gather_hook = ( |
There was a problem hiding this comment.
@shjwudp Can we override this to be True when we have fp8_recipe as MX-FP8 and fp8_param_gather is True? Users might forget to set this for MX-FP8 and might face issues, so better to override right?
There was a problem hiding this comment.
Isn't that what they were already doing? See megatron/core/distributed/fsdp/mcore_fsdp_adapter.py.
There was a problem hiding this comment.
Yes, but I think @shjwudp interpreted your feedback as "let's just make it a manual toggle".
What you want is OR logic:
enable_fine_grained_param_gather_hook=config.megatron_fsdp_enable_fine_grained_param_gather or (
config.fp8_recipe == "mxfp8" and ddp_config.fp8_param_gather
),
Also nit: can we make this argument and config name shorter: megatron_fsdp_enable_fine_grained_param_gather -> megatron_fsdp_fine_grained_param_ag
There was a problem hiding this comment.
In that case, when we run any [mxfp8+native fp8 model param] config without recompute, we'll always run with fine-grained AG. I thought that's what we don't want.
I thought the idea is that we don't want fine-grained AG, especially when we don't have recompute.
There was a problem hiding this comment.
I think Ritesh is correct, lets not override it, but we should set it outside in M-LM training loop based on fp8_param, fp8_recipe and recompute modules present or not.
There was a problem hiding this comment.
Oh shoot, that is a typo, I forgot to add the recompute part: args.recompute_granularity == 'selective'.
Then to summarize, if we are using MXFP8 and are using activation recompute, then fine-grained AG must be used. Otherwise, you can either toggle it, or just not set it at all.
# Optional: OR set it as an argument.
enable_fine_grained_param_gather_hook=(
config.fp8_recipe == "mxfp8" and ddp_config.fp8_param_gather and args.recompute_granularity == 'selective'
)
There was a problem hiding this comment.
This looks good to me. I'm not sure if there's a way to add what @cspades mentioned within MCore itself. But if there is, then it makes sense to have that as a heuristic.
cspades
left a comment
There was a problem hiding this comment.
Definitely could use more documentation as well, it's not obvious how fine-grained / non-recursive module AG is related to row-wise / col-wise parameters.
The Megatron-FSDP logic that figures out what parameters to gather, as well as the TransformerEngine logic for when row-wise or col-wise data exists, should be explained in the future.
|
LGTM. |
|
/ok to test 1cd7838 |
|
/ok to test d91371d |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26204426356 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26204532704 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26204542460 |
…le performance–memory trade-offs (NVIDIA#4181)
…le performance–memory trade-offs (NVIDIA#4181)
What does this PR do ?
This change updates the MXFP8 implementation so that
fine_grained_param_gather, previously always enabled, is now exposed as a configurable option.Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.