Support multimodule pipelining in 1F1B schedule by yashaswikarnati · Pull Request #3129 · NVIDIA/Megatron-LM

yashaswikarnati · 2026-01-28T22:53:53Z

Summary

Adds support for multi-module pipeline parallelism (encoder + LLM) in the 1F1B schedule.

Changes:

Add MultiModuleProcessGroupCollection for managing process groups across modules
Support dict-based tensor format {module_name: tensor} in forward/backward
Handle 2D/3D tensor conversion for P2P and bridge communication
Add backward_step_multimodule to handle backward for multimodule cases

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

- Rename ProcessGroupCollectionWrapper to MultiModuleProcessGroupCollection - Rename language_model field to language_model_module_name for clarity - Add language_model_module_name param to backward_step_multimodule - Use functools.partial to bind param, keeping signature consistent - Add type hints to _ensure_3d_tensor and _restore_tensor_shape - Move is_multimodule check earlier for validation and backward selection

copy-pr-bot · 2026-01-28T22:53:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

dimapihtar · 2026-01-29T21:28:18Z

/ok to test 2d7c176

shifangx · 2026-03-13T08:30:43Z

/ok to test 6542743

Replace num_warmup_microbatches property on P2PCommunicator and MultiModulePipelineCommunicator with total_stages and current_stage properties. Compute num_warmup_microbatches in schedules.py instead. Addresses review feedback from jaredcasper on PR NVIDIA#3129. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # megatron/core/pipeline_parallel/schedules.py

… tests - Add HyperCommGrid.destroy() and BridgeCommunicator.destroy_broadcast_pgs() to clean up PGs created during tests - Add expt_dp grid dimension and cache embedding PGs to prevent creation of undestroyed PGs in DDP init and add_embedding_groups - Reuse pg_collection across finalize_model_grads calls instead of rebuilding from scratch each iteration - Add teardown_method to bridge/communicator/schedules test classes

yashaswikarnati · 2026-03-16T20:37:21Z

/ok to test edc8159

embd, pos_embd, pp, dp_cp are already validated in finalize_model_grads. Only tp and cp are directly used in the schedule functions.

yashaswikarnati · 2026-03-16T21:11:40Z

/ok to test 92b65d1

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

yashaswikarnati · 2026-03-17T22:03:42Z

/ok to test 78ee58c

svcnvidia-nemo-ci · 2026-03-18T05:04:03Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23229957578

…edule (NVIDIA#3129) New files: - tests/unit_tests/pipeline_parallel/test_multimodule_schedules.py

These test files import from existing modules that are modified in Phase 2: - test_rmsnorm_residual_fusion.py: imports TEFusedResidualRMSNorm (added in NVIDIA#3384) - test_mup.py: imports get_mup_config_overrides (added in NVIDIA#3058) - test_multimodule_schedules.py: imports MultiModuleProcessGroupCollection (added in NVIDIA#3129) They will be re-added in Phase 2 when the corresponding code changes land. Made-with: Cursor

Resolve merge conflicts in 8 files after syncing fork with upstream (including merged #3129). Take main's improvements for multimodule communicator, p2p_communication, process_groups_config, and schedule tests. Keep both non-colocated PP changes and main's new features. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: ykarnati <ykarnati@nvidia.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Shifang Xu <shifangx@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yashaswikarnati added 7 commits January 27, 2026 11:48

add pp stage checkers to p2p communicator

c601de4

add process group collection wrapper

84ae4f0

support multimodule pipelining in 1f1b schedule

0fa3dd8

fix dim mapping in torch cat bridge comm

b22f638

handle 3d 2d tensor conversion in multimodule comm

3badf57

add unit tests for multimodule pipeline schedules

20d03f5

yashaswikarnati requested review from a team as code owners January 28, 2026 22:53

ko3n1g requested a review from a team January 28, 2026 22:54

yashaswikarnati and others added 3 commits January 28, 2026 15:25

rename module_collections to module_pgs for clarity

b102eb7

rename tensor conversion functions for clarity

ebbb509

Merge branch 'main' into yash/1f1b_changes

2d7c176

dimapihtar added complexity: high Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Jan 29, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 21:28 Inactive

ko3n1g added this to the Core 0.16 milestone Jan 29, 2026

copy-pr-bot Bot had a problem deploying to nemo-ci January 29, 2026 21:28 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 21:28 Inactive

yaoyu-33 reviewed Feb 2, 2026

View reviewed changes

Comment thread megatron/core/pipeline_parallel/bridge_communicator.py

yashaswikarnati mentioned this pull request Feb 2, 2026

Add multi-module heterogeneous parallelism support for MIMO model #3211

Merged

6 tasks

Fix linting issues: format code and remove unused imports

0b6cefd

yaoyu-33 reviewed Feb 4, 2026

View reviewed changes

Comment thread megatron/core/pipeline_parallel/multimodule_communicator.py

yaoyu-33 reviewed Feb 4, 2026

View reviewed changes

Comment thread megatron/core/pipeline_parallel/schedules.py

dimapihtar requested a review from erhoo82 February 4, 2026 15:10

Merge branch 'main' into yash/1f1b_changes

597862e

svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Mar 13, 2026

copy-pr-bot Bot temporarily deployed to test March 13, 2026 08:31 Inactive

shifangx enabled auto-merge March 13, 2026 08:32

yashaswikarnati and others added 3 commits March 13, 2026 17:42

Merge remote-tracking branch 'upstream/main' into yash/1f1b_changes

738db94

# Conflicts: # megatron/core/pipeline_parallel/schedules.py

copy-pr-bot Bot temporarily deployed to test March 16, 2026 20:38 Inactive

Remove redundant pg_collection asserts from schedules.py

92b65d1

embd, pos_embd, pp, dp_cp are already validated in finalize_model_grads. Only tp and cp are directly used in the schedule functions.

yashaswikarnati force-pushed the yash/1f1b_changes branch from 2ad399d to 92b65d1 Compare March 16, 2026 20:59

copy-pr-bot Bot temporarily deployed to test March 16, 2026 21:14 Inactive

Add missing copyright header to test_bridge_communicator.py

78ee58c

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

copy-pr-bot Bot temporarily deployed to test March 17, 2026 22:04 Inactive

jaredcasper approved these changes Mar 17, 2026

View reviewed changes

svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Mar 17, 2026

shifangx added this pull request to the merge queue Mar 18, 2026

Merged via the queue into NVIDIA:main with commit 0ca9b63 Mar 18, 2026
55 of 57 checks passed

ilml added a commit to ilml/Megatron-LM that referenced this pull request Mar 20, 2026

Add new files from 0ca9b63 Support multimodule pipelining in 1F1B sch…

fdd847c

…edule (NVIDIA#3129) New files: - tests/unit_tests/pipeline_parallel/test_multimodule_schedules.py

This was referenced Apr 13, 2026

support for training qwen3 vl with dist train NVIDIA-NeMo/Megatron-Bridge#2367

Merged

[draft]DistTrain demo for mutil module training #1995

Closed

sbhavani mentioned this pull request May 26, 2026

[ROADMAP][2026 Q2] Megatron Core Roadmap #4997

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multimodule pipelining in 1F1B schedule#3129

Support multimodule pipelining in 1F1B schedule#3129
shifangx merged 24 commits into
NVIDIA:mainfrom
yashaswikarnati:yash/1f1b_changes

yashaswikarnati commented Jan 28, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jan 28, 2026

Uh oh!

dimapihtar commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shifangx commented Mar 13, 2026

Uh oh!

yashaswikarnati commented Mar 16, 2026

Uh oh!

yashaswikarnati commented Mar 16, 2026

Uh oh!

yashaswikarnati commented Mar 17, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

yashaswikarnati commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot Bot commented Jan 28, 2026

Uh oh!

dimapihtar commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shifangx commented Mar 13, 2026

Uh oh!

yashaswikarnati commented Mar 16, 2026

Uh oh!

yashaswikarnati commented Mar 16, 2026

Uh oh!

yashaswikarnati commented Mar 17, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

yashaswikarnati commented Jan 28, 2026 •

edited

Loading

(Step 1): Add PR label `Expert Review`