Add MIMO hetero topology + distributed bootstrap (examples/mimo training-loop folder) by yashaswikarnati · Pull Request #5260 · NVIDIA/Megatron-LM

yashaswikarnati · 2026-06-10T05:16:21Z

Adds per-module HyperCommGrid topology and a distributed bootstrap for hetero MIMO training under examples/mimo/training/, building each module's process groups (TP/CP/PP/DP, expert views, and language embedding groups) and packaging them into a MultiModuleProcessGroupCollection.

Why

Enables MIMO to run on the stock megatron/training loop via an explicit ProcessGroupCollection rather than parallel_state globals. Process groups are owned by HyperCommGrid, not parallel_state. The default None preserves current behavior; there are no changes to megatron/core and no homogeneous/non-MIMO behavior changes.

Testing

Real cog 8-GPU distributed unit test (no mocks), run name mm1-topology-test: 4 passed (test_grids_partition_world, test_pgc_group_sizes, test_validate_rejects_overlapping_not_equal, test_validate_rejects_gap_in_world_coverage). The homogeneous goldens remain the next CI gate.

Validation now enforces that module grids partition the world [0, world_size) with no gaps in addition to the pairwise-disjoint-XOR-fully-shared invariant.

Stacking

Standalone on origin/main (no stacking).

🤖 Generated with Claude Code

copy-pr-bot · 2026-06-10T05:16:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yashaswikarnati · 2026-06-10T06:40:20Z

Review comment resolutions

Mapping each review comment to its resolution in the latest revision (75e2d04):

Distributed init kept (why stock cannot replace it): distributed.py retains a torch.distributed + global-memory-buffer bootstrap. Stock mpu.initialize_model_parallel cannot be used because it materializes the model-parallel globals for a single homogeneous world tiling; the hetero MIMO topology needs per-module grids with disjoint rank offsets, so we bring up torch.distributed WITHOUT initializing MPU and assert the MPU globals stay uninitialized.
Assert simplified: assert_parallel_state_uninitialized is now a compact loop over the model-parallel globals (DP/TP/PP/CP/embd/pos_embd) rather than a call to model_parallel_is_initialized, so a partial/leaked init is caught.
print_rank_0 reused: now imported from megatron.training.utils instead of redefined locally.
specs[-1] replaced: the language module is now selected via the explicit is_language_module flag on the spec; create_topology validates exactly one spec sets it.
language_module_name + side dict deleted: removed; the layout is driven by is_language_module / RankRole / ModuleLayout.
Embedding groups in PGC: language embedding groups now live in PGC.embd / PGC.pos_embd, built collectively via parallel_state.default_embedding_ranks / default_position_embedding_ranks as real ProcessGroups (encoder modules get None).
is_current_rank_in_grid filter kept + explained: retained to scope per-module construction to ranks actually in the grid; commented to explain why.
GroupMember sentinel replaced: membership is now checked via get_rank(group) >= 0 (handles -1 for non-members) instead of the GroupMember sentinel.
Validation preserved: the world-tiling invariant (grids tile the world disjointly XOR fully share ranks) is preserved and exercised by the test, including rejection of overlapping-but-not-equal and gap-in-coverage layouts.

Validation

8-GPU real-distributed run on cw-dfw: tests/unit_tests/test_mimo_hetero_topology.py — 5 passed (no skips):
test_grids_partition_world, test_pgc_group_sizes, test_embedding_groups, test_validate_rejects_overlapping_not_equal, test_validate_rejects_gap_in_world_coverage.

yashaswikarnati · 2026-06-10T16:28:12Z

Round-2 review comments addressed

Docstrings (4 trimmed for concision while preserving architectural meaning):

distributed.py module docstring and initialize_distributed collapsed to a one-line summary plus a one-line justification.
topology.py module docstring and _build_language_embedding_groups trimmed; the essential invariants (per-module grids, embedding-group construction) are kept.

Expert dims (expt_tp / expt_dp):

These are now resolved to concrete ints and validated inside ModuleGridSpec.__post_init__ (no None sentinels left to flow downstream). expt_tp defaults to tp; expt_dp is derived as size // (expt_tp * ep * pp), with divisibility and product checks (expt_tp * ep * expt_dp * pp == size) enforced at construction.
_build_grid is simplified to read these already-resolved concrete values directly, with no None-fallback branching.

dp kept explicit (rationale):

The module-view size invariant is size = tp * cp * pp * dp, so dp is a first-class field of the module grid. expt_dp is its expert-view analog and is derived in __post_init__ from the expert factorization; keeping dp explicit mirrors the two distinct views (module vs. expert) rather than overloading one field.

Verified on an 8-GPU real-distributed run: all 8 tests in tests/unit_tests/test_mimo_hetero_topology.py pass (including new TestModuleGridSpecResolution cases for implicit/explicit expert resolution and invalid factorization).

Add a production training-loop folder examples/mimo/training/ with two modules ported from the hetero prototype and cleaned to production quality: - topology.py: builds per-module HyperCommGrid(s) from a layout-general ModuleGridSpec (rank offsets come from the spec, not hardcoded), adapts each grid into a ProcessGroupCollection, assembles a MultiModuleProcessGroupCollection, and creates the language embedding groups. Uses the on-main named-view API (register_view with shared_dims) for the expert factorization, routes colocated/non-colocated detection through RankRole.build, and validates the invariant that grids either tile the world disjointly XOR fully share ranks. Ships the non-colocated configuration without precluding a colocated one. - distributed.py: torch.distributed + global-memory-buffer bootstrap that does not call mpu.initialize_model_parallel and asserts the parallel_state model-parallel globals are uninitialized. Add an 8-GPU real-distributed unit test asserting the two grids partition the world, the per-module PGC group sizes match the factorization, and the invalidation rejects an overlapping-but-not-equal layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

yashaswikarnati · 2026-06-10T21:15:53Z

/ok to test c82f09f

svcnvidia-nemo-ci · 2026-06-10T22:43:43Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27311132496

svcnvidia-nemo-ci · 2026-06-10T23:25:57Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27312935448

svcnvidia-nemo-ci · 2026-06-10T23:41:04Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27313542518

svcnvidia-nemo-ci · 2026-06-11T00:39:31Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27315861359

…ing-loop folder) (NVIDIA#5260) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>