M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs by shjwudp · Pull Request #4181 · NVIDIA/Megatron-LM

shjwudp · 2026-04-07T15:48:41Z

What does this PR do ?

This change updates the MXFP8 implementation so that fine_grained_param_gather, previously always enabled, is now exposed as a configurable option.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

copy-pr-bot · 2026-04-07T15:48:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-04-07T15:48:56Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

sanandaraj5597 · 2026-04-07T16:38:42Z

        self.calculate_per_token_loss = calculate_per_token_loss
        self.init_model_with_meta_device = init_model_with_meta_device
-        self.enable_fine_grained_param_gather_hook = enable_fine_grained_param_gather_hook
+        self.enable_fine_grained_param_gather_hook = (


@shjwudp Can we override this to be True when we have fp8_recipe as MX-FP8 and fp8_param_gather is True? Users might forget to set this for MX-FP8 and might face issues, so better to override right?

Isn't that what they were already doing? See megatron/core/distributed/fsdp/mcore_fsdp_adapter.py.

Yes, but I think @shjwudp interpreted your feedback as "let's just make it a manual toggle".

What you want is OR logic:

enable_fine_grained_param_gather_hook=config.megatron_fsdp_enable_fine_grained_param_gather or ( config.fp8_recipe == "mxfp8" and ddp_config.fp8_param_gather ),

Also nit: can we make this argument and config name shorter: megatron_fsdp_enable_fine_grained_param_gather -> megatron_fsdp_fine_grained_param_ag

In that case, when we run any [mxfp8+native fp8 model param] config without recompute, we'll always run with fine-grained AG. I thought that's what we don't want.

I thought the idea is that we don't want fine-grained AG, especially when we don't have recompute.

I think Ritesh is correct, lets not override it, but we should set it outside in M-LM training loop based on fp8_param, fp8_recipe and recompute modules present or not.

Oh shoot, that is a typo, I forgot to add the recompute part: args.recompute_granularity == 'selective'.

Then to summarize, if we are using MXFP8 and are using activation recompute, then fine-grained AG must be used. Otherwise, you can either toggle it, or just not set it at all.

# Optional: OR set it as an argument. enable_fine_grained_param_gather_hook=( config.fp8_recipe == "mxfp8" and ddp_config.fp8_param_gather and args.recompute_granularity == 'selective' )

This looks good to me. I'm not sure if there's a way to add what @cspades mentioned within MCore itself. But if there is, then it makes sense to have that as a heuristic.

cspades

Definitely could use more documentation as well, it's not obvious how fine-grained / non-recursive module AG is related to row-wise / col-wise parameters.

The Megatron-FSDP logic that figures out what parameters to gather, as well as the TransformerEngine logic for when row-wise or col-wise data exists, should be explained in the future.

sanandaraj5597 · 2026-05-15T03:24:11Z

LGTM.

Phlip79 · 2026-05-15T16:47:21Z

/ok to test 1cd7838

shjwudp · 2026-05-19T00:47:54Z

/ok to test d91371d

svcnvidia-nemo-ci · 2026-05-21T03:54:11Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26204426356

svcnvidia-nemo-ci · 2026-05-21T03:57:44Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26204532704

svcnvidia-nemo-ci · 2026-05-21T03:58:04Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26204542460

…le performance–memory trade-offs (NVIDIA#4181)

add --megatron-fsdp-enable-fine-grained-param-gather option

1cd7838

shjwudp requested review from a team as code owners April 7, 2026 15:48

svcnvidia-nemo-ci marked this pull request as draft April 7, 2026 15:48

shjwudp marked this pull request as ready for review April 7, 2026 15:49

shjwudp changed the title ~~Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs~~ M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs Apr 7, 2026

svcnvidia-nemo-ci requested a review from a team April 7, 2026 15:49

shjwudp requested review from cspades and sanandaraj5597 and removed request for a team April 7, 2026 15:49

shjwudp self-assigned this Apr 7, 2026

shjwudp added the module: megatron-fsdp label Apr 7, 2026

svcnvidia-nemo-ci added the complexity: low label Apr 7, 2026

sanandaraj5597 reviewed Apr 7, 2026

View reviewed changes

cspades approved these changes Apr 7, 2026

View reviewed changes

svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Apr 7, 2026

Phlip79 added the 26.06 label May 12, 2026

erhoo82 added the nemotron label May 13, 2026

Phlip79 requested a review from sanandaraj5597 May 14, 2026 19:49

sanandaraj5597 approved these changes May 15, 2026

View reviewed changes

svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label May 15, 2026

shjwudp enabled auto-merge May 15, 2026 05:41

copy-pr-bot Bot temporarily deployed to test May 15, 2026 16:48 Inactive

Phlip79 requested review from a team and removed request for a team May 15, 2026 16:48

Phlip79 requested a review from a team May 16, 2026 19:55

deepakn94 approved these changes May 18, 2026

View reviewed changes

Merge remote-tracking branch 'nvidia/main' into mfsdp_opt_finegrained_ag

d91371d

copy-pr-bot Bot temporarily deployed to public May 19, 2026 00:48 Inactive

copy-pr-bot Bot temporarily deployed to test May 19, 2026 00:48 Inactive

copy-pr-bot Bot temporarily deployed to public May 19, 2026 00:51 Inactive

copy-pr-bot Bot temporarily deployed to public May 19, 2026 00:52 Inactive

copy-pr-bot Bot temporarily deployed to public May 19, 2026 00:59 Inactive

Phlip79 added the Final Review PR is in the "final review" stage label May 19, 2026

ericharper approved these changes May 20, 2026

View reviewed changes

svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label May 20, 2026

shjwudp added this pull request to the merge queue May 20, 2026

Phlip79 added the Approved All necessary approvals have been made label May 20, 2026

Merged via the queue into NVIDIA:main with commit b5d143f May 21, 2026
79 of 82 checks passed

shjwudp deleted the mfsdp_opt_finegrained_ag branch May 21, 2026 06:18

sbhavani mentioned this pull request May 26, 2026

[ROADMAP][2026 Q2] Megatron Core Roadmap #4997

Open

janEbert pushed a commit to janEbert/Megatron-LM that referenced this pull request Jun 2, 2026

M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enab…

5d37281

…le performance–memory trade-offs (NVIDIA#4181)

mathemakitten pushed a commit to mathemakitten/Megatron-LM that referenced this pull request Jun 12, 2026

M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enab…

7cc4537

…le performance–memory trade-offs (NVIDIA#4181)

Conversation

shjwudp commented Apr 7, 2026

What does this PR do ?

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Apr 7, 2026

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

sanandaraj5597 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

rapatel Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

cspades Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rapatel Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanandaraj5597 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

cspades Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rapatel Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

cspades left a comment

Choose a reason for hiding this comment

Uh oh!

sanandaraj5597 commented May 15, 2026

Uh oh!

Phlip79 commented May 15, 2026

Uh oh!

shjwudp commented May 19, 2026

Uh oh!

svcnvidia-nemo-ci commented May 21, 2026

Uh oh!

svcnvidia-nemo-ci commented May 21, 2026

Uh oh!

svcnvidia-nemo-ci commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

cspades Apr 7, 2026 •

edited

Loading

rapatel Apr 7, 2026 •

edited

Loading

cspades Apr 7, 2026 •

edited

Loading