Skip to content

M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs#4181

Merged
shjwudp merged 2 commits into
NVIDIA:mainfrom
shjwudp:mfsdp_opt_finegrained_ag
May 21, 2026
Merged

M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs#4181
shjwudp merged 2 commits into
NVIDIA:mainfrom
shjwudp:mfsdp_opt_finegrained_ag

Conversation

@shjwudp

@shjwudp shjwudp commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

This change updates the MXFP8 implementation so that fine_grained_param_gather, previously always enabled, is now exposed as a configurable option.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

@shjwudp shjwudp requested review from a team as code owners April 7, 2026 15:48
@copy-pr-bot

copy-pr-bot Bot commented Apr 7, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft April 7, 2026 15:48
@github-actions

github-actions Bot commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@shjwudp shjwudp marked this pull request as ready for review April 7, 2026 15:49
@shjwudp shjwudp changed the title Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs Apr 7, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 7, 2026 15:49
@shjwudp shjwudp requested review from cspades and sanandaraj5597 and removed request for a team April 7, 2026 15:49
@shjwudp shjwudp self-assigned this Apr 7, 2026
self.calculate_per_token_loss = calculate_per_token_loss
self.init_model_with_meta_device = init_model_with_meta_device
self.enable_fine_grained_param_gather_hook = enable_fine_grained_param_gather_hook
self.enable_fine_grained_param_gather_hook = (

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shjwudp Can we override this to be True when we have fp8_recipe as MX-FP8 and fp8_param_gather is True? Users might forget to set this for MX-FP8 and might face issues, so better to override right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that what they were already doing? See megatron/core/distributed/fsdp/mcore_fsdp_adapter.py.

@cspades cspades Apr 7, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think @shjwudp interpreted your feedback as "let's just make it a manual toggle".

What you want is OR logic:

enable_fine_grained_param_gather_hook=config.megatron_fsdp_enable_fine_grained_param_gather or (
    config.fp8_recipe == "mxfp8" and ddp_config.fp8_param_gather
),

Also nit: can we make this argument and config name shorter: megatron_fsdp_enable_fine_grained_param_gather -> megatron_fsdp_fine_grained_param_ag

@rapatel rapatel Apr 7, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, when we run any [mxfp8+native fp8 model param] config without recompute, we'll always run with fine-grained AG. I thought that's what we don't want.

I thought the idea is that we don't want fine-grained AG, especially when we don't have recompute.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Ritesh is correct, lets not override it, but we should set it outside in M-LM training loop based on fp8_param, fp8_recipe and recompute modules present or not.

@cspades cspades Apr 7, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh shoot, that is a typo, I forgot to add the recompute part: args.recompute_granularity == 'selective'.

Then to summarize, if we are using MXFP8 and are using activation recompute, then fine-grained AG must be used. Otherwise, you can either toggle it, or just not set it at all.

# Optional: OR set it as an argument.
enable_fine_grained_param_gather_hook=(
    config.fp8_recipe == "mxfp8" and ddp_config.fp8_param_gather and args.recompute_granularity == 'selective'
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I'm not sure if there's a way to add what @cspades mentioned within MCore itself. But if there is, then it makes sense to have that as a heuristic.

@cspades cspades left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely could use more documentation as well, it's not obvious how fine-grained / non-recursive module AG is related to row-wise / col-wise parameters.

The Megatron-FSDP logic that figures out what parameters to gather, as well as the TransformerEngine logic for when row-wise or col-wise data exists, should be explained in the future.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Apr 7, 2026
@Phlip79 Phlip79 added the 26.06 label May 12, 2026
@Phlip79 Phlip79 requested a review from sanandaraj5597 May 14, 2026 19:49
@sanandaraj5597

Copy link
Copy Markdown
Contributor

LGTM.

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label May 15, 2026
@shjwudp shjwudp enabled auto-merge May 15, 2026 05:41
@Phlip79

Phlip79 commented May 15, 2026

Copy link
Copy Markdown
Member

/ok to test 1cd7838

@Phlip79 Phlip79 requested review from a team and removed request for a team May 15, 2026 16:48
@Phlip79 Phlip79 requested a review from a team May 16, 2026 19:55
@shjwudp

shjwudp commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test d91371d

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label May 20, 2026
@shjwudp shjwudp added this pull request to the merge queue May 20, 2026
@Phlip79 Phlip79 added the Approved All necessary approvals have been made label May 20, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26204426356

@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26204532704

@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26204542460

Merged via the queue into NVIDIA:main with commit b5d143f May 21, 2026
79 of 82 checks passed
@shjwudp shjwudp deleted the mfsdp_opt_finegrained_ag branch May 21, 2026 06:18
janEbert pushed a commit to janEbert/Megatron-LM that referenced this pull request Jun 2, 2026
mathemakitten pushed a commit to mathemakitten/Megatron-LM that referenced this pull request Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants