Skip to content

[refactor] Common combined-1F1B schedule-plan base (1/4 of #4798)#4941

Open
Connor-XY wants to merge 6 commits into
NVIDIA:mainfrom
Connor-XY:pr4798-1-common-schedule-refactor
Open

[refactor] Common combined-1F1B schedule-plan base (1/4 of #4798)#4941
Connor-XY wants to merge 6 commits into
NVIDIA:mainfrom
Connor-XY:pr4798-1-common-schedule-refactor

Conversation

@Connor-XY

@Connor-XY Connor-XY commented May 22, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Part 1 of 4 splitting #4798 by @Wohox and @Connor-XY into smaller PRs to reduce the number of CODEOWNERS reviewer groups per PR. Original changes by @Wohox and @Connor-XY; this PR is purely the model-agnostic refactor with no behavior change for GPT/MTP.

Summary

Move model-agnostic combined-1F1B schedule-plan helpers out of gpt/fine_grained_callables.py into megatron/core/models/common/ so non-GPT models (HybridStack, in part 2) can build the same schedule plans:

  • Add megatron/core/models/common/utils.py (houses _BackwardDWWrapper, previously in gpt/fine_grained_callables.py), common/fine_grained_callables.py, and common/model_chunk_schedule_plan.py with the shared abstractions.
  • Reduce gpt/fine_grained_callables.py to GPT-specific pieces; reuse the new common base classes.
  • Switch GraphableMegatronModule.init_backward_dw_wrapper to import _BackwardDWWrapper from common.utils.
  • Relax combined_forward_backward_step's GPTModel-only assert to a duck-type on build_schedule_plan so any model implementing it can participate in EP overlap.
  • Carry the f6ea23b fix from [feat] Hybrid model ep overlapping main #4798: apply the main decoder's final_norm in MTP pre-dispatch when a VPP chunk has no main HybridStack layers (the HybridModel reference is a deferred import inside the function, so this file remains importable without the hybrid package).

Why this slice

Touches 5 reviewer groups: core-adlr, core-nemo, gpt, transformer, pipeline-parallelism. The hybrid- / FSDP- / training-specific reviewers are not needed here.

Dependencies

None — this PR is mergeable on its own. Parts 2/3/4 depend on this PR.

Validation

The full integrated change was validated in #4798 (unit tests + DeepSeek-V3 deterministic Transformer/Hybrid baseline + EP-overlap smoke tests). This slice is a refactor with no behavior change for the GPT/MTP path; behavior was confirmed bitwise-identical by @Wohox on EOS 8-node for the carried-over f6ea23b fix (lm_loss bitwise-identical at iter 1 with A2A_OVERLAP=1 vs 0).

Issue tracking

Linked issue: part of #4798.

Pre-checks

🤖 Generated with Claude Code

@copy-pr-bot

copy-pr-bot Bot commented May 22, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wujingyue wujingyue left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review after expert reviews are done.

@wujingyue

Copy link
Copy Markdown
Contributor

But thanks for breaking down your PR and thanks for the refactor!

@Connor-XY Connor-XY force-pushed the pr4798-1-common-schedule-refactor branch 2 times, most recently from 0784750 to d65a251 Compare June 1, 2026 18:05
@Connor-XY

Copy link
Copy Markdown
Contributor Author

/claude review

Comment thread megatron/core/models/common/fine_grained_callables.py Outdated
Comment thread megatron/core/models/common/utils.py Outdated
@Connor-XY Connor-XY force-pushed the pr4798-1-common-schedule-refactor branch from d65a251 to d163ed8 Compare June 1, 2026 18:46
@Connor-XY

Copy link
Copy Markdown
Contributor Author

/claude review

Comment thread megatron/core/models/common/utils.py Outdated
Comment thread megatron/core/pipeline_parallel/combined_1f1b.py Outdated
@Connor-XY Connor-XY force-pushed the pr4798-1-common-schedule-refactor branch from d163ed8 to 22539e6 Compare June 1, 2026 20:03
@Connor-XY

Copy link
Copy Markdown
Contributor Author

/claude review

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Connor-XY Connor-XY force-pushed the pr4798-1-common-schedule-refactor branch from 22539e6 to c1c3585 Compare June 2, 2026 01:08
@Connor-XY

Copy link
Copy Markdown
Contributor Author

/ok to test c1c3585

@Connor-XY Connor-XY force-pushed the pr4798-1-common-schedule-refactor branch 3 times, most recently from b32b394 to 123d858 Compare June 3, 2026 16:51
Connor-XY and others added 5 commits June 3, 2026 11:48
Move model-agnostic schedule-plan helpers out of gpt/fine_grained_callables.py
into megatron/core/models/common/ so non-GPT models (HybridStack) can build the
same combined-1F1B / EP-overlap schedule plans. No behavior change for GPT/MTP.

- Add megatron/core/models/common/{utils.py, fine_grained_callables.py,
  model_chunk_schedule_plan.py} with the shared abstractions.
- Reduce megatron/core/models/gpt/fine_grained_callables.py to GPT-specific
  pieces; reuse the new common base classes.
- Switch GraphableMegatronModule.init_backward_dw_wrapper to import
  _BackwardDWWrapper from the new common module.
- Relax combined_forward_backward_step's GPTModel-only assert to a duck-type
  on build_schedule_plan so any model implementing it can participate.

Part 1/4 of splitting NVIDIA#4798 (original changes by @Wohox).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unk)

Carries over upstream commit f6ea23b from NVIDIA#4798: when a hybrid layer
pattern places MTP in a post_process VPP chunk that holds no main
HybridStack layers (e.g. trailing pipe before the MTP separator), the
EP-overlap schedule never invokes ``_maybe_apply_final_norm`` on the main
path, so the unnormalized hidden_states feed straight into the LM head and
lm_loss diverges by ~10x.

Fix in ``submodule_mtp_pre_dispatch_forward``: run the main decoder's
``final_norm`` just before ``torch.chunk``, gated on
``len(model.decoder.layers) == 0`` and ``isinstance(model, HybridModel)``.
The HybridModel import is deferred inside the function so this file does
not gain a module-level dependency on the hybrid package — PR 1 remains
independently importable.

Part 1/4 of splitting NVIDIA#4798 (original changes by @Wohox).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Connor-XY Connor-XY force-pushed the pr4798-1-common-schedule-refactor branch from 123d858 to eb4c325 Compare June 3, 2026 18:48
@Connor-XY

Copy link
Copy Markdown
Contributor Author

/ok to test eb4c325

@Connor-XY

Copy link
Copy Markdown
Contributor Author

/ok to test 21484bc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants