Add named layouts to HyperCommGrid for heterogeneous parallelism by yashaswikarnati · Pull Request #5148 · NVIDIA/Megatron-LM

yashaswikarnati · 2026-06-04T04:24:02Z

What

Add named views to HyperCommGrid: one grid (a single rank span) can register extra factorizations beyond the implicit base view, then create / retrieve / enumerate process groups against any of them via a keyword-only view=.

register_view(name, shape, dim_names, shared_dims=None) — register and validate a named factorization over the same ranks.
create_pg / get_pg / get_rank_enum accept view= (defaults to base, so the single-view path is unchanged).
Base-view group keys are byte-for-byte unchanged; view-private groups use namespaced keys, and dims listed in shared_dims reuse the base group instead of duplicating it.

Why

Foundational for heterogeneous / non-colocated parallelism, where a dense (tp/cp/dp/pp) and an expert (expt_tp/ep/expt_dp/pp) factorization span the same ranks with different shapes. These are alternate tilings of one rank set — not orthogonal axes, so they can't be a single cube. register_view models each as a separate factorization that must agree on any shared_dims.

How

_RankViewSpec holds each factorization; the base view is auto-registered from the constructor args.
Rank enumeration is generalized to any view's shape/dim_names via np.moveaxis + reshape (drops the einops dependency).
register_view proves each shared_dim enumerates identically to the base view, so shared groups are reused rather than rebuilt.
Robustness: destroy() skips groups this rank isn't a member of (NON_GROUP_MEMBER sentinel) and frees each shared group once; the rank-0 log is guarded by is_initialized().

Fully backward compatible — all new params are optional and keyword-only. Covered by tests/unit_tests/test_hyper_comm_grid.py, including a real-distributed 8-GPU view test.

copy-pr-bot · 2026-06-04T04:24:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yashaswikarnati · 2026-06-04T05:05:50Z

Reworked per review — thanks for the careful read. Summary of changes:

Shared dims are reused, not duplicated. register_layout(..., shared_dims=["pp"]) validates that a shared dimension's rank enumeration matches the base layout's, and a group spanning only shared dims now returns the base grid's group object (same ranks). This honors the invariant that dense and expert pipeline groups must be identical (parallel_state.py decoder_rank_generator.get_ranks("pp") == expert_decoder_rank_generator.get_ranks("pp")). The previous "distinct expert:pp" behavior is gone.
No implicit layout inference. Removed base-precedence resolution. Layout-private groups are reachable only through an explicit GridLayout handle (grid.get_layout("expert").get_pg(...)); the base grid's create_pg/get_pg/get_rank_enum are unchanged and operate on the base layout only.
Smaller surface / internal keys. Dropped has_layout and the "<layout>:<dims>" string namespacing; layout-private groups are keyed by (layout_name, ordered_dims) tuples. Base-grid behavior is byte-for-byte unchanged for existing callers.
Real distributed coverage. Replaced the monkey-patched create_pg tests with a real 8-rank integration test (test_real_distributed_registered_layout) that registers base + expert layouts, creates groups in both, asserts actual rank membership of an expert-private group (with a real all_reduce), asserts the shared pp group is the same object/ranks as base, and that destroy() frees reused groups exactly once. Pure-Python register_layout validation tests are kept.

Validated on 1 node × 8 GPUs (torch.distributed.run --nproc-per-node 8): all green.

Allow a single HyperCommGrid (one rank span) to carry additional named factorizations beyond its base layout, so dense and expert parallel factorizations can share the same ranks with different shapes. - register_layout(name, shape, dim_names, shared_dims=None) returns a GridLayout handle; get_layout(name) retrieves it. The handle exposes explicit create_pg / get_pg / get_rank_enum against that layout. The base grid's own methods are unchanged and operate only on the base layout (no implicit cross-layout inference). - shared_dims declares dimensions that must coincide with the base layout (e.g. pipeline parallelism, which must span identical ranks for the dense and expert parts). Registration validates that a shared dimension's rank enumeration matches the base layout's, and a group spanning only shared dims reuses the base grid's group object rather than creating a duplicate. - Layout-private groups are keyed by (layout_name, ordered_dims). Also add two partial-participation robustness guards: skip the rank-0 log when torch.distributed is not initialized, and in destroy() only tear down groups this rank is a member of (deduping reused groups by identity so a shared group is not freed twice). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Make `view` keyword-only on create_pg/get_pg/get_rank_enum so a stray positional arg cannot silently bind to it. - Accept numpy integer shape entries in register_view, matching the base grid constructor (isinstance check now uses numbers.Integral). - Reject duplicate shared_dims with a clear ValueError instead of a cryptic numpy "repeated axis" error. - Drive the rank-0 creation log off the canonical key so a shared-dim request canonicalized onto the base group is labelled as base, not view. - Derive the enumeration size from the passed shape in _gen_rank_enum_for instead of self.size, removing an implicit instance coupling. - Remove the now-dead _order_dims wrapper (no production callers); point its unit tests at _order_dims_for, collapsing three ordering helpers to two. Add regression tests for the numpy-int shape and duplicate shared_dims cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge test_register_view_stores_rank_view into the copy-semantics test (renamed test_register_view_success_stores_copied_metadata), which already covered the same registration path. The merged test now also asserts the None return value and the stored view name, so no coverage is lost. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address review feedback: collapse the _is_process_group_member docstring to one line and condense the class-level views paragraph to two lines. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

yashaswikarnati · 2026-06-08T23:35:48Z

/ok to test 644aab2

yashaswikarnati · 2026-06-09T19:24:28Z

/ok to test dd90fc6

svcnvidia-nemo-ci · 2026-06-09T20:22:01Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27233418142

yashaswikarnati added the Run tests label Jun 4, 2026

yashaswikarnati force-pushed the ykarnati/upstream-hypercommgrid-multilayout branch from 22c7543 to a17e366 Compare June 4, 2026 05:05

yashaswikarnati changed the title ~~Add multi-layout support to HyperCommGrid~~ Add named layouts to HyperCommGrid for heterogeneous parallelism Jun 4, 2026

yashaswikarnati commented Jun 4, 2026

View reviewed changes