Add named layouts to HyperCommGrid for heterogeneous parallelism#5148
Merged
yashaswikarnati merged 6 commits intoJun 9, 2026
Merged
Conversation
22c7543 to
a17e366
Compare
Contributor
Author
|
Reworked per review — thanks for the careful read. Summary of changes:
Validated on 1 node × 8 GPUs ( |
yashaswikarnati
commented
Jun 4, 2026
yashaswikarnati
commented
Jun 4, 2026
yashaswikarnati
commented
Jun 4, 2026
yashaswikarnati
commented
Jun 4, 2026
Allow a single HyperCommGrid (one rank span) to carry additional named factorizations beyond its base layout, so dense and expert parallel factorizations can share the same ranks with different shapes. - register_layout(name, shape, dim_names, shared_dims=None) returns a GridLayout handle; get_layout(name) retrieves it. The handle exposes explicit create_pg / get_pg / get_rank_enum against that layout. The base grid's own methods are unchanged and operate only on the base layout (no implicit cross-layout inference). - shared_dims declares dimensions that must coincide with the base layout (e.g. pipeline parallelism, which must span identical ranks for the dense and expert parts). Registration validates that a shared dimension's rank enumeration matches the base layout's, and a group spanning only shared dims reuses the base grid's group object rather than creating a duplicate. - Layout-private groups are keyed by (layout_name, ordered_dims). Also add two partial-participation robustness guards: skip the rank-0 log when torch.distributed is not initialized, and in destroy() only tear down groups this rank is a member of (deduping reused groups by identity so a shared group is not freed twice). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a17e366 to
98f093f
Compare
yashaswikarnati
commented
Jun 5, 2026
yashaswikarnati
commented
Jun 5, 2026
yashaswikarnati
commented
Jun 5, 2026
yashaswikarnati
commented
Jun 5, 2026
yashaswikarnati
commented
Jun 5, 2026
- Make `view` keyword-only on create_pg/get_pg/get_rank_enum so a stray positional arg cannot silently bind to it. - Accept numpy integer shape entries in register_view, matching the base grid constructor (isinstance check now uses numbers.Integral). - Reject duplicate shared_dims with a clear ValueError instead of a cryptic numpy "repeated axis" error. - Drive the rank-0 creation log off the canonical key so a shared-dim request canonicalized onto the base group is labelled as base, not view. - Derive the enumeration size from the passed shape in _gen_rank_enum_for instead of self.size, removing an implicit instance coupling. - Remove the now-dead _order_dims wrapper (no production callers); point its unit tests at _order_dims_for, collapsing three ordering helpers to two. Add regression tests for the numpy-int shape and duplicate shared_dims cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merge test_register_view_stores_rank_view into the copy-semantics test (renamed test_register_view_success_stores_copied_metadata), which already covered the same registration path. The merged test now also asserts the None return value and the stored view name, so no coverage is lost. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
yashaswikarnati
commented
Jun 8, 2026
yashaswikarnati
commented
Jun 8, 2026
Address review feedback: collapse the _is_process_group_member docstring to one line and condense the class-level views paragraph to two lines. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
deepakn94
approved these changes
Jun 8, 2026
yaoyu-33
approved these changes
Jun 8, 2026
Contributor
Author
|
/ok to test 644aab2 |
Contributor
Author
|
/ok to test dd90fc6 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27233418142 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Add named views to
HyperCommGrid: one grid (a single rank span) can register extra factorizations beyond the implicitbaseview, then create / retrieve / enumerate process groups against any of them via a keyword-onlyview=.register_view(name, shape, dim_names, shared_dims=None)— register and validate a named factorization over the same ranks.create_pg/get_pg/get_rank_enumacceptview=(defaults tobase, so the single-view path is unchanged).shared_dimsreuse the base group instead of duplicating it.Why
Foundational for heterogeneous / non-colocated parallelism, where a dense (
tp/cp/dp/pp) and an expert (expt_tp/ep/expt_dp/pp) factorization span the same ranks with different shapes. These are alternate tilings of one rank set — not orthogonal axes, so they can't be a single cube.register_viewmodels each as a separate factorization that must agree on anyshared_dims.How
_RankViewSpecholds each factorization; the base view is auto-registered from the constructor args.shape/dim_namesvianp.moveaxis+ reshape (drops the einops dependency).register_viewproves eachshared_dimenumerates identically to the base view, so shared groups are reused rather than rebuilt.destroy()skips groups this rank isn't a member of (NON_GROUP_MEMBERsentinel) and frees each shared group once; the rank-0 log is guarded byis_initialized().Fully backward compatible — all new params are optional and keyword-only. Covered by
tests/unit_tests/test_hyper_comm_grid.py, including a real-distributed 8-GPU view test.