[refactor] Common combined-1F1B schedule-plan base (1/4 of #4798) by Connor-XY · Pull Request #4941 · NVIDIA/Megatron-LM

Connor-XY · 2026-05-22T17:05:59Z

What does this PR do?

Part 1 of 4 splitting #4798 by @Wohox and @Connor-XY into smaller PRs to reduce the number of CODEOWNERS reviewer groups per PR. Original changes by @Wohox and @Connor-XY; this PR is purely the model-agnostic refactor with no behavior change for GPT/MTP.

Summary

Move model-agnostic combined-1F1B schedule-plan helpers out of gpt/fine_grained_callables.py into megatron/core/models/common/ so non-GPT models (HybridStack, in part 2) can build the same schedule plans:

Add megatron/core/models/common/utils.py (houses _BackwardDWWrapper, previously in gpt/fine_grained_callables.py), common/fine_grained_callables.py, and common/model_chunk_schedule_plan.py with the shared abstractions.
Reduce gpt/fine_grained_callables.py to GPT-specific pieces; reuse the new common base classes.
Switch GraphableMegatronModule.init_backward_dw_wrapper to import _BackwardDWWrapper from common.utils.
Relax combined_forward_backward_step's GPTModel-only assert to a duck-type on build_schedule_plan so any model implementing it can participate in EP overlap.
Carry the f6ea23b fix from [feat] Hybrid model ep overlapping main #4798: apply the main decoder's final_norm in MTP pre-dispatch when a VPP chunk has no main HybridStack layers (the HybridModel reference is a deferred import inside the function, so this file remains importable without the hybrid package).

Why this slice

Touches 5 reviewer groups: core-adlr, core-nemo, gpt, transformer, pipeline-parallelism. The hybrid- / FSDP- / training-specific reviewers are not needed here.

Dependencies

None — this PR is mergeable on its own. Parts 2/3/4 depend on this PR.

Validation

The full integrated change was validated in #4798 (unit tests + DeepSeek-V3 deterministic Transformer/Hybrid baseline + EP-overlap smoke tests). This slice is a refactor with no behavior change for the GPT/MTP path; behavior was confirmed bitwise-identical by @Wohox on EOS 8-node for the carried-over f6ea23b fix (lm_loss bitwise-identical at iter 1 with A2A_OVERLAP=1 vs 0).

Issue tracking

Linked issue: part of #4798.

Pre-checks

Carried-over commits already validated upstream in [feat] Hybrid model ep overlapping main #4798
Refactor preserves GPT/MTP behavior
Each split PR preserves or credits the original authors' commit messages

🤖 Generated with Claude Code

copy-pr-bot · 2026-05-22T17:06:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

wujingyue

I'll review after expert reviews are done.

wujingyue · 2026-05-28T21:55:43Z

But thanks for breaking down your PR and thanks for the refactor!

Connor-XY · 2026-06-01T18:05:29Z

/claude review

Connor-XY · 2026-06-01T19:46:20Z

/claude review

Connor-XY · 2026-06-02T00:55:47Z

/claude review

claude

LGTM

Connor-XY · 2026-06-02T01:09:59Z

/ok to test c1c3585

@Wohox

Move model-agnostic schedule-plan helpers out of gpt/fine_grained_callables.py into megatron/core/models/common/ so non-GPT models (HybridStack) can build the same combined-1F1B / EP-overlap schedule plans. No behavior change for GPT/MTP. - Add megatron/core/models/common/{utils.py, fine_grained_callables.py, model_chunk_schedule_plan.py} with the shared abstractions. - Reduce megatron/core/models/gpt/fine_grained_callables.py to GPT-specific pieces; reuse the new common base classes. - Switch GraphableMegatronModule.init_backward_dw_wrapper to import _BackwardDWWrapper from the new common module. - Relax combined_forward_backward_step's GPTModel-only assert to a duck-type on build_schedule_plan so any model implementing it can participate. Part 1/4 of splitting NVIDIA#4798 (original changes by @Wohox). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Wohox

…unk) Carries over upstream commit f6ea23b from NVIDIA#4798: when a hybrid layer pattern places MTP in a post_process VPP chunk that holds no main HybridStack layers (e.g. trailing pipe before the MTP separator), the EP-overlap schedule never invokes ``_maybe_apply_final_norm`` on the main path, so the unnormalized hidden_states feed straight into the LM head and lm_loss diverges by ~10x. Fix in ``submodule_mtp_pre_dispatch_forward``: run the main decoder's ``final_norm`` just before ``torch.chunk``, gated on ``len(model.decoder.layers) == 0`` and ``isinstance(model, HybridModel)``. The HybridModel import is deferred inside the function so this file does not gain a module-level dependency on the hybrid package — PR 1 remains independently importable. Part 1/4 of splitting NVIDIA#4798 (original changes by @Wohox). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Connor-XY · 2026-06-03T18:49:14Z

/ok to test eb4c325

Connor-XY · 2026-06-04T00:02:45Z

/ok to test 21484bc

This was referenced May 22, 2026

[feat] HybridStack grouped syntax + checkpoint compat + EP-overlap (2/4 of #4798) #4942

Draft

[feat] FSDP support for HybridStack EP-overlap (3/4 of #4798) #4943

Draft

[fix] Training wire-up for hybrid EP-overlap (4/4 of #4798) #4944

Draft

Phlip79 marked this pull request as ready for review May 28, 2026 00:36

Phlip79 requested review from a team as code owners May 28, 2026 00:36

svcnvidia-nemo-ci requested a review from a team May 28, 2026 00:36

svcnvidia-nemo-ci added the complexity: high label May 28, 2026

wujingyue reviewed May 28, 2026

View reviewed changes

Connor-XY force-pushed the pr4798-1-common-schedule-refactor branch 2 times, most recently from 0784750 to d65a251 Compare June 1, 2026 18:05

claude Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread megatron/core/models/common/fine_grained_callables.py Outdated

claude Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread megatron/core/models/common/utils.py Outdated

Connor-XY force-pushed the pr4798-1-common-schedule-refactor branch from d65a251 to d163ed8 Compare June 1, 2026 18:46

claude Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread megatron/core/models/common/utils.py Outdated

claude Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread megatron/core/pipeline_parallel/combined_1f1b.py Outdated

Connor-XY force-pushed the pr4798-1-common-schedule-refactor branch from d163ed8 to 22539e6 Compare June 1, 2026 20:03

claude Bot approved these changes Jun 2, 2026

View reviewed changes

Connor-XY force-pushed the pr4798-1-common-schedule-refactor branch from 22539e6 to c1c3585 Compare June 2, 2026 01:08

Connor-XY added the Run tests label Jun 2, 2026

copy-pr-bot Bot temporarily deployed to public June 2, 2026 01:10 Inactive

Connor-XY force-pushed the pr4798-1-common-schedule-refactor branch 3 times, most recently from b32b394 to 123d858 Compare June 3, 2026 16:51

Connor-XY and others added 5 commits June 3, 2026 11:48

fix: address PR1 review comments

89e85d6

fix: address follow-up PR1 comments

c151346

fix: address Keshav review comments

eb4c325

Connor-XY force-pushed the pr4798-1-common-schedule-refactor branch from 123d858 to eb4c325 Compare June 3, 2026 18:48

copy-pr-bot Bot temporarily deployed to public June 3, 2026 18:50 Inactive

copy-pr-bot Bot temporarily deployed to public June 3, 2026 18:53 Inactive

copy-pr-bot Bot temporarily deployed to public June 3, 2026 19:01 Inactive

fix: apply black formatting to MTP unit test

21484bc

copy-pr-bot Bot temporarily deployed to public June 4, 2026 00:03 Inactive

copy-pr-bot Bot temporarily deployed to test June 4, 2026 00:03 Inactive

copy-pr-bot Bot temporarily deployed to public June 4, 2026 00:06 Inactive

copy-pr-bot Bot temporarily deployed to public June 4, 2026 00:07 Inactive

copy-pr-bot Bot temporarily deployed to public June 4, 2026 00:15 Inactive

Phlip79 requested a review from santhnm2 June 8, 2026 16:48

Phlip79 approved these changes Jun 11, 2026

View reviewed changes

santhnm2 approved these changes Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[refactor] Common combined-1F1B schedule-plan base (1/4 of #4798)#4941

[refactor] Common combined-1F1B schedule-plan base (1/4 of #4798)#4941
Connor-XY wants to merge 6 commits into
NVIDIA:mainfrom
Connor-XY:pr4798-1-common-schedule-refactor

Connor-XY commented May 22, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

wujingyue left a comment

Uh oh!

wujingyue commented May 28, 2026

Uh oh!

Connor-XY commented Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

Connor-XY commented Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

Connor-XY commented Jun 2, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Connor-XY commented Jun 2, 2026

Uh oh!

Connor-XY commented Jun 3, 2026

Uh oh!

Connor-XY commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Connor-XY commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Summary

Why this slice

Dependencies

Validation

Issue tracking

Pre-checks

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue commented May 28, 2026

Uh oh!

Connor-XY commented Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

Connor-XY commented Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

Connor-XY commented Jun 2, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Connor-XY commented Jun 2, 2026

Uh oh!

Connor-XY commented Jun 3, 2026

Uh oh!

Connor-XY commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Connor-XY commented May 22, 2026 •

edited

Loading