Add MIMO hetero topology + distributed bootstrap (examples/mimo training-loop folder)#5260
Conversation
2d2bc6e to
75e2d04
Compare
Review comment resolutionsMapping each review comment to its resolution in the latest revision (
Validation8-GPU real-distributed run on cw-dfw: |
75e2d04 to
a2902bb
Compare
Round-2 review comments addressedDocstrings (4 trimmed for concision while preserving architectural meaning):
Expert dims (
Verified on an 8-GPU real-distributed run: all 8 tests in |
dfab655 to
c37c93c
Compare
c37c93c to
a212d18
Compare
Add a production training-loop folder examples/mimo/training/ with two modules ported from the hetero prototype and cleaned to production quality: - topology.py: builds per-module HyperCommGrid(s) from a layout-general ModuleGridSpec (rank offsets come from the spec, not hardcoded), adapts each grid into a ProcessGroupCollection, assembles a MultiModuleProcessGroupCollection, and creates the language embedding groups. Uses the on-main named-view API (register_view with shared_dims) for the expert factorization, routes colocated/non-colocated detection through RankRole.build, and validates the invariant that grids either tile the world disjointly XOR fully share ranks. Ships the non-colocated configuration without precluding a colocated one. - distributed.py: torch.distributed + global-memory-buffer bootstrap that does not call mpu.initialize_model_parallel and asserts the parallel_state model-parallel globals are uninitialized. Add an 8-GPU real-distributed unit test asserting the two grids partition the world, the per-module PGC group sizes match the factorization, and the invalidation rejects an overlapping-but-not-equal layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a212d18 to
c82f09f
Compare
|
/ok to test c82f09f |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27311132496 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27312935448 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27313542518 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27315861359 |
…ing-loop folder) (NVIDIA#5260) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds per-module HyperCommGrid topology and a distributed bootstrap for hetero MIMO training under examples/mimo/training/, building each module's process groups (TP/CP/PP/DP, expert views, and language embedding groups) and packaging them into a MultiModuleProcessGroupCollection.
Why
Enables MIMO to run on the stock megatron/training loop via an explicit ProcessGroupCollection rather than parallel_state globals. Process groups are owned by HyperCommGrid, not parallel_state. The default None preserves current behavior; there are no changes to megatron/core and no homogeneous/non-MIMO behavior changes.
Testing
Real cog 8-GPU distributed unit test (no mocks), run name mm1-topology-test: 4 passed (test_grids_partition_world, test_pgc_group_sizes, test_validate_rejects_overlapping_not_equal, test_validate_rejects_gap_in_world_coverage). The homogeneous goldens remain the next CI gate.
Validation now enforces that module grids partition the world [0, world_size) with no gaps in addition to the pairwise-disjoint-XOR-fully-shared invariant.
Stacking
Standalone on origin/main (no stacking).
🤖 Generated with Claude Code