[Split 1/N of #3430] feat(mHC): basic pytorch implementation of manifold hyper connection by Connor-XY · Pull Request #4531 · NVIDIA/Megatron-LM

Connor-XY · 2026-04-29T18:41:25Z

What does this PR do?

This is the first of a sequence of split PRs derived from #3430
(mirror of upstream #2943 + kernel-fusion #3828). The goal of splitting is
to keep each PR small and limit the number of CODEOWNERS reviewer groups
required, making review faster and easier. Subsequent splits will be
opened only after this one merges — strictly one open at a time.

This PR contains only the core Manifold Hyper Connection (mHC) module
and its directly-supporting changes:

megatron/core/transformer/hyper_connection.py (new mHC module)
megatron/core/transformer/transformer_block.py (mHC integration into the block)
megatron/core/transformer/transformer_layer.py (HyperConnectionTransformerLayer + recompute hooks)
megatron/core/transformer/transformer_config.py (new mHC config fields)
megatron/core/transformer/__init__.py (re-exports of mHC symbols)
megatron/core/transformer/cuda_graphs.py (copyright-year bump only)
megatron/core/tensor_parallel/random.py (CheckpointWithoutOutput / CheckpointManager refactor used by mHC recompute)
megatron/core/fusions/fused_bias_dropout.py (extends get_bias_dropout_add with an optional mhc_recompute_manager path)

Reviewer groups touched: core-adlr, core-nemo, transformer, cuda-graphs.

Deferred to follow-up split PRs (NOT in this PR)

GPT model wiring — gpt_layer_specs.py, experimental specs, gpt_builders.py, training/initialize.py, model/fp8 tests
Pipeline-parallel support — pipeline_parallel/schedules.py, test_pp_mhc_compatibility.py
Fused mHC cuTile kernels — the #3828 kernel side (fused_mhc_kernels.py, kernel-related fields in transformer_config.py, kernel-using paths in hyper_connection.py, test_fused_mhc_kernels.py)
Functional-test recipe — tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp2_pp2_mhc/*, tests/test_utils/recipes/h100/gpt.yaml

Unit-test coverage for components in this PR

Every component added or modified by this PR has a corresponding unit test
included in this same PR:

Component	Unit test (in this PR)
`hyper_connection.py` (new module)	`tests/unit_tests/transformer/test_hyper_connection_recompute.py`, `tests/unit_tests/transformer/test_mhc_block_manager.py`, `tests/unit_tests/transformer/test_transformer_layer.py` (HyperConnectionTransformerLayer cases)
`transformer_layer.py` (mHC variant + recompute hooks)	`tests/unit_tests/transformer/test_transformer_layer.py`
`transformer_block.py` (mHC integration)	`tests/unit_tests/transformer/test_mhc_block_manager.py`, `tests/unit_tests/transformer/test_transformer_layer.py`
`transformer_config.py` (new mHC fields)	exercised by configs in `tests/unit_tests/transformer/test_transformer_layer.py`
`tensor_parallel/random.py` (`CheckpointWithoutOutput` + `ckpt_manager`)	`tests/unit_tests/transformer/test_hyper_connection_recompute.py`
`fusions/fused_bias_dropout.py` (`mhc_recompute_manager` path)	exercised indirectly via the mHC MLP/recompute path in `tests/unit_tests/transformer/test_hyper_connection_recompute.py` and `tests/unit_tests/transformer/test_transformer_layer.py`
`transformer/__init__.py`	export-only (no dedicated test — symbols are imported and exercised by the tests above)
`transformer/cuda_graphs.py`	copyright-year bump only (no test required)

The functional test (golden-values + model_config) for the end-to-end mHC
configuration lives in the deferred functional-test split PR, since that
file is @NVIDIA/ci-owned and orthogonal to the transformer reviewers.

Credit

All changes here are a re-slice of @jingqiny-99's work in #3430
(upstream basic pytorch implementation #2943). No new functionality has
been added, removed, or rewritten in this split — only repackaged for
narrower review scope.

Pre-checks

Re-slice of an existing PR; no new feature
Original unit tests included for all components touched
Functional test deferred to a later split PR (per split plan)
PR opened as draft per repo contributing guidelines

copy-pr-bot · 2026-04-29T18:41:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Connor-XY · 2026-04-29T19:03:25Z

/claude review

claude

Review — Split 1/N mHC core module

The core mHC implementation, CheckpointManager, _save_args_to_ctx/_load_args_from_ctx refactor, and the HyperConnectionTransformerLayer all look correct. The recompute plan in TransformerBlock, Sinkhorn-Knopp backward via recomputation, and the unified gradient hook design are clean.

One critical issue: all new test classes in test_transformer_layer.py call get_gpt_layer_with_transformer_engine_spec(enable_hyper_connection=True), but gpt_layer_specs.py is not modified in this PR and does not accept that parameter. This will cause every new test class (TestTransformerLayerWithHyperConnectionRecompute, TestMHCRecomputeMemorySaving, TestMHCWithCudaGraph, TestMHCWithOffloading) to fail with a TypeError. See inline comment for suggested fixes.

The tests in test_hyper_connection_recompute.py and test_mhc_block_manager.py construct modules directly and should be unaffected.

Address Claude review feedback on NVIDIA#4531: the new mHC test classes called get_gpt_layer_with_transformer_engine_spec(enable_hyper_connection=True), but the enable_hyper_connection kwarg on gpt_layer_specs.py is added in the deferred GPT-wiring follow-up split. As-is, the tests would have failed at construction with TypeError, and even past that would have tripped HyperConnectionTransformerLayer's IdentityOp assertion on the mHC submodules. Add a small _make_mhc_layer_spec helper that calls the unmodified get_gpt_layer_with_transformer_engine_spec() and patches in HyperConnectionModule for self_attention_hyper_connection and mlp_hyper_connection, so the tests are self-contained for this split. The five call sites in the new mHC test classes (TestTransformerLayerWithHyperConnectionRecompute, TestMHCRecomputeMemorySaving, TestMHCWithCudaGraph, TestMHCWithOffloading) are updated to use the helper.

Connor-XY · 2026-04-29T19:16:20Z

/claude review

claude

LGTM

Connor-XY · 2026-04-29T19:23:29Z

/ok to test 4a6e07c

Address Claude review feedback on NVIDIA#4531: the new mHC test classes called get_gpt_layer_with_transformer_engine_spec(enable_hyper_connection=True), but the enable_hyper_connection kwarg on gpt_layer_specs.py is added in the deferred GPT-wiring follow-up split. As-is, the tests would have failed at construction with TypeError, and even past that would have tripped HyperConnectionTransformerLayer's IdentityOp assertion on the mHC submodules. Add a small _make_mhc_layer_spec helper that calls the unmodified get_gpt_layer_with_transformer_engine_spec() and patches in HyperConnectionModule for self_attention_hyper_connection and mlp_hyper_connection, so the tests are self-contained for this split. The five call sites in the new mHC test classes (TestTransformerLayerWithHyperConnectionRecompute, TestMHCRecomputeMemorySaving, TestMHCWithCudaGraph, TestMHCWithOffloading) are updated to use the helper.

Connor-XY · 2026-04-29T19:33:13Z

/ok to test 6ee487e

Address Claude review feedback on NVIDIA#4531: the new mHC test classes called get_gpt_layer_with_transformer_engine_spec(enable_hyper_connection=True), but the enable_hyper_connection kwarg on gpt_layer_specs.py is added in the deferred GPT-wiring follow-up split. As-is, the tests would have failed at construction with TypeError, and even past that would have tripped HyperConnectionTransformerLayer's IdentityOp assertion on the mHC submodules. Add a small _make_mhc_layer_spec helper that calls the unmodified get_gpt_layer_with_transformer_engine_spec() and patches in HyperConnectionModule for self_attention_hyper_connection and mlp_hyper_connection, so the tests are self-contained for this split. The five call sites in the new mHC test classes (TestTransformerLayerWithHyperConnectionRecompute, TestMHCRecomputeMemorySaving, TestMHCWithCudaGraph, TestMHCWithOffloading) are updated to use the helper.

Connor-XY · 2026-04-29T20:47:14Z

/ok to test 0481591

… forward The Codecov patch-coverage gate on NVIDIA#4531 fails (68.25% < 80% target) because two new code paths in this split lack any test: * fused_bias_dropout._get_checkpointed_bda — only reachable via the mHC layer's recompute path, which the unit-test lane doesn't exercise. * transformer_block._build_mhc_recompute_layer_plan, _finalize_mhc_recompute_layer, and the input_expand / output_contract calls in TransformerBlock.forward — only reachable from a full block forward with mHC selective recompute enabled. Neither original test exists in PR NVIDIA#3430: * No test in the upstream PR references get_bias_dropout_add or _get_checkpointed_bda. * The block-level mHC forward test in NVIDIA#3430 (TestPPForwardWithMHC) is multi-GPU-only (torchrun --nproc-per-node=2+) and so wouldn't run in the codecov-feeding unit-test lane even if lifted. Add direct unit coverage: tests/unit_tests/fusions/test_bias_dropout_fusion.py TestBiasDropoutAddMhcRecompute test_checkpointed_bda_forward_backward (parametrized fused/with_bias) test_checkpointed_bda_chained_managers test_get_bda_without_manager_unchanged tests/unit_tests/transformer/test_mhc_block_manager.py TestTransformerBlockMHCRecompute test_recompute_plan_no_layer_num test_recompute_plan_with_layer_num test_recompute_plan_disabled test_block_forward_input_expand_output_contract Combined, these directly hit ~80 of the previously-missed lines in fused_bias_dropout.py (~14 of 19 missed) and transformer_block.py (~12 of 16 missed), which should bring patch coverage above the 80% threshold.

Connor-XY · 2026-04-30T01:28:55Z

/claude review

Connor-XY · 2026-04-30T01:29:07Z

/ok to test 54d47ce

claude

LGTM

Address Claude review feedback on NVIDIA#4531: the new mHC test classes called get_gpt_layer_with_transformer_engine_spec(enable_hyper_connection=True), but the enable_hyper_connection kwarg on gpt_layer_specs.py is added in the deferred GPT-wiring follow-up split. As-is, the tests would have failed at construction with TypeError, and even past that would have tripped HyperConnectionTransformerLayer's IdentityOp assertion on the mHC submodules. Add a small _make_mhc_layer_spec helper that calls the unmodified get_gpt_layer_with_transformer_engine_spec() and patches in HyperConnectionModule for self_attention_hyper_connection and mlp_hyper_connection, so the tests are self-contained for this split. The five call sites in the new mHC test classes (TestTransformerLayerWithHyperConnectionRecompute, TestMHCRecomputeMemorySaving, TestMHCWithCudaGraph, TestMHCWithOffloading) are updated to use the helper.

… forward The Codecov patch-coverage gate on NVIDIA#4531 fails (68.25% < 80% target) because two new code paths in this split lack any test: * fused_bias_dropout._get_checkpointed_bda — only reachable via the mHC layer's recompute path, which the unit-test lane doesn't exercise. * transformer_block._build_mhc_recompute_layer_plan, _finalize_mhc_recompute_layer, and the input_expand / output_contract calls in TransformerBlock.forward — only reachable from a full block forward with mHC selective recompute enabled. Neither original test exists in PR NVIDIA#3430: * No test in the upstream PR references get_bias_dropout_add or _get_checkpointed_bda. * The block-level mHC forward test in NVIDIA#3430 (TestPPForwardWithMHC) is multi-GPU-only (torchrun --nproc-per-node=2+) and so wouldn't run in the codecov-feeding unit-test lane even if lifted. Add direct unit coverage: tests/unit_tests/fusions/test_bias_dropout_fusion.py TestBiasDropoutAddMhcRecompute test_checkpointed_bda_forward_backward (parametrized fused/with_bias) test_checkpointed_bda_chained_managers test_get_bda_without_manager_unchanged tests/unit_tests/transformer/test_mhc_block_manager.py TestTransformerBlockMHCRecompute test_recompute_plan_no_layer_num test_recompute_plan_with_layer_num test_recompute_plan_disabled test_block_forward_input_expand_output_contract Combined, these directly hit ~80 of the previously-missed lines in fused_bias_dropout.py (~14 of 19 missed) and transformer_block.py (~12 of 16 missed), which should bring patch coverage above the 80% threshold.

Resolves all eleven comments on the PR thread: * Rename CheckpointManager → CheckpointWithoutOutputManager and update the docstring; the class strictly manages CheckpointWithoutOutput instances, so the new name avoids the broader "checkpoint" overloading. Updates all importers and tests. (NVIDIA#1) * Document why subtracting the per-row max in SinkhornKnopp.forward is benign — Sinkhorn's first row-normalization cancels any per-row scalar, so the shifted and unshifted exp produce the same fixed point and gradient. (NVIDIA#2) * Use NotImplementedError for the mhc + fine_grained_activation_offloading block — it's a known unimplemented interaction, not a config error. (NVIDIA#3) * Drop the new __call__ override and backward_dw_cudagraph from base TransformerLayer; the mHC kwarg extraction now lives on HyperConnectionTransformerLayer.__call__, with _mhc_recompute_manager initialized in __init__ so forward() reads it directly without a getattr fallback. cuda_graphs.py reads is_decode_only() directly, so dropping the dynamic_inference_decode_only injection is safe. (NVIDIA#4, NVIDIA#5, NVIDIA#10) * Rename the FineGrainedActivationOffloadingInterface alias off_interface → offload_interface in transformer_layer.py for clarity. (NVIDIA#6) * Extract a _run_mlp helper on TransformerLayer that owns the MLP-call branching (recompute / chunked-prefill / fp8-fp4 / plain-mlp); both base and HC _forward_mlp call it, eliminating the previous ~80-line duplication. The MoE-cudagraph early-return remains in base _forward_mlp after the helper call (HC is guarded against MoE). (NVIDIA#8) * Raise NotImplementedError at HyperConnectionTransformerLayer.__init__ when is_moe_layer is True and point users at HyperConnectionHybridLayer; drop the dead MoE branch in _get_submodules_under_cudagraphs. (NVIDIA#9) * No code change for the MoE composition / extensibility comment (NVIDIA#7) — see the PR thread reply for the rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@mathemakitten

Mirrors the _forward_attention refactor for the MLP path, per @mathemakitten's review feedback on PR NVIDIA#4531. Base TransformerLayer changes: * Add _run_pre_mlp_layernorm: wraps _forward_pre_mlp_layernorm to handle tuple-output unpacking and the fp32_residual_connection cast, and returns (pre_mlp_layernorm_output, residual, mlp_state) where mlp_state is an opaque payload subclasses can use to thread extra intermediates to _apply_mlp_bda_step. Base returns (). * Rename _forward_post_mlp -> _apply_mlp_bda_step and add an mlp_state=() parameter (ignored by base). Updates the three internal callers (the base _forward_mlp skeleton, the cudagraph fallback path, and MoETransformerLayer._forward_mlp_postprocess). * Base _forward_mlp now reads as: _run_pre_mlp_layernorm -> _run_mlp -> MoE-cudagraph branch (unchanged) -> _apply_mlp_bda_step. The MoE early-return stays in base since HC guards against MoE submodules. HyperConnectionTransformerLayer changes: * Drop _forward_mlp (inherited from base). * Drop _forward_post_mlp_with_fused_hyper_connection. * Override _run_pre_mlp_layernorm: HC pre-wrap (mlp_hyper_connection), mHC-aware checkpoint condition, returns the n-stream residual and threads (mlp_h_res, mlp_hc_h_post) via mlp_state. * Override _apply_mlp_bda_step: unpacks the mlp_state slot, computes the mhc_mlp_bda_manager from self._mhc_recompute_manager (None on the last layer of a recompute block, since the block-end finalize hook handles its discard), runs fused_h_res_h_post_bda + viewless wrap. * Update HC.forward to drop the now-unneeded mhc_recompute_manager kwarg on the _forward_mlp call (the manager flows via self). * Extend the comment near self._mhc_recompute_manager to enumerate the four reader methods (_run_input_layernorm, _apply_self_attn_bda_step, _run_pre_mlp_layernorm, _apply_mlp_bda_step) so future maintainers don't try to "clean up" what looks like an unused attribute. Net effect: HC drops ~110 lines of duplicated MLP plumbing in exchange for two ~40-line override methods, and the contract for thread-state between hooks now matches attention exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Connor-XY · 2026-05-18T18:13:01Z

/ok to test 461af34

@jingqiny-99

… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>

Address Claude review feedback on NVIDIA#4531: the new mHC test classes called get_gpt_layer_with_transformer_engine_spec(enable_hyper_connection=True), but the enable_hyper_connection kwarg on gpt_layer_specs.py is added in the deferred GPT-wiring follow-up split. As-is, the tests would have failed at construction with TypeError, and even past that would have tripped HyperConnectionTransformerLayer's IdentityOp assertion on the mHC submodules. Add a small _make_mhc_layer_spec helper that calls the unmodified get_gpt_layer_with_transformer_engine_spec() and patches in HyperConnectionModule for self_attention_hyper_connection and mlp_hyper_connection, so the tests are self-contained for this split. The five call sites in the new mHC test classes (TestTransformerLayerWithHyperConnectionRecompute, TestMHCRecomputeMemorySaving, TestMHCWithCudaGraph, TestMHCWithOffloading) are updated to use the helper. Signed-off-by: Yan Xu <yxu1@nvidia.com>

CI run #25129636936 (job tests/unit_tests/models/**/*.py) failed with "Mamba MoE config drift detected!" because PR NVIDIA#1 adds five new TransformerConfig fields (enable_hyper_connections, mhc_init_gating_factor, mhc_recompute_layer_num, mhc_sinkhorn_iterations, num_residual_streams) but this PR previously deferred the corresponding test_mamba_moe_model.py GOLDEN_CONFIG update (the file has since been renamed to test_hybrid_moe_model.py on main). This is a generic config-drift sentinel — any PR that adds TransformerConfig fields must update this golden snapshot — so it belongs in this split, not in a deferred follow-up. Add the five keys with the same default values used in the original PR NVIDIA#3430 / upstream NVIDIA#2943. Signed-off-by: Yan Xu <yxu1@nvidia.com>

… forward The Codecov patch-coverage gate on NVIDIA#4531 fails (68.25% < 80% target) because two new code paths in this split lack any test: * fused_bias_dropout._get_checkpointed_bda — only reachable via the mHC layer's recompute path, which the unit-test lane doesn't exercise. * transformer_block._build_mhc_recompute_layer_plan, _finalize_mhc_recompute_layer, and the input_expand / output_contract calls in TransformerBlock.forward — only reachable from a full block forward with mHC selective recompute enabled. Neither original test exists in PR NVIDIA#3430: * No test in the upstream PR references get_bias_dropout_add or _get_checkpointed_bda. * The block-level mHC forward test in NVIDIA#3430 (TestPPForwardWithMHC) is multi-GPU-only (torchrun --nproc-per-node=2+) and so wouldn't run in the codecov-feeding unit-test lane even if lifted. Add direct unit coverage: tests/unit_tests/fusions/test_bias_dropout_fusion.py TestBiasDropoutAddMhcRecompute test_checkpointed_bda_forward_backward (parametrized fused/with_bias) test_checkpointed_bda_chained_managers test_get_bda_without_manager_unchanged tests/unit_tests/transformer/test_mhc_block_manager.py TestTransformerBlockMHCRecompute test_recompute_plan_no_layer_num test_recompute_plan_with_layer_num test_recompute_plan_disabled test_block_forward_input_expand_output_contract Combined, these directly hit ~80 of the previously-missed lines in fused_bias_dropout.py (~14 of 19 missed) and transformer_block.py (~12 of 16 missed), which should bring patch coverage above the 80% threshold. Signed-off-by: Yan Xu <yxu1@nvidia.com>

Lint failure in CI run #25142652872 (job 73695715090): black wanted the attention_mask construction in test_block_forward_input_expand_output_contract on a single line — at 91 chars it fits the project's line_length=100 limit. Signed-off-by: Yan Xu <yxu1@nvidia.com>

Resolves all eleven comments on the PR thread: * Rename CheckpointManager → CheckpointWithoutOutputManager and update the docstring; the class strictly manages CheckpointWithoutOutput instances, so the new name avoids the broader "checkpoint" overloading. Updates all importers and tests. (NVIDIA#1) * Document why subtracting the per-row max in SinkhornKnopp.forward is benign — Sinkhorn's first row-normalization cancels any per-row scalar, so the shifted and unshifted exp produce the same fixed point and gradient. (NVIDIA#2) * Use NotImplementedError for the mhc + fine_grained_activation_offloading block — it's a known unimplemented interaction, not a config error. (NVIDIA#3) * Drop the new __call__ override and backward_dw_cudagraph from base TransformerLayer; the mHC kwarg extraction now lives on HyperConnectionTransformerLayer.__call__, with _mhc_recompute_manager initialized in __init__ so forward() reads it directly without a getattr fallback. cuda_graphs.py reads is_decode_only() directly, so dropping the dynamic_inference_decode_only injection is safe. (NVIDIA#4, NVIDIA#5, NVIDIA#10) * Rename the FineGrainedActivationOffloadingInterface alias off_interface → offload_interface in transformer_layer.py for clarity. (NVIDIA#6) * Extract a _run_mlp helper on TransformerLayer that owns the MLP-call branching (recompute / chunked-prefill / fp8-fp4 / plain-mlp); both base and HC _forward_mlp call it, eliminating the previous ~80-line duplication. The MoE-cudagraph early-return remains in base _forward_mlp after the helper call (HC is guarded against MoE). (NVIDIA#8) * Raise NotImplementedError at HyperConnectionTransformerLayer.__init__ when is_moe_layer is True and point users at HyperConnectionHybridLayer; drop the dead MoE branch in _get_submodules_under_cudagraphs. (NVIDIA#9) * No code change for the MoE composition / extensibility comment (NVIDIA#7) — see the PR thread reply for the rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>

The pop was defensive cleanup against an older signature. No caller in the repo passes is_last_layer_in_recompute_block as a kwarg today — TransformerBlock sets it as an attribute on the manager object before each layer call (see transformer_block.py around line 886), and the forward path reads it via getattr on the manager. Removing the dead pop per follow-up review feedback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>

Resolves the latent duplication between TransformerLayer._forward_attention and HyperConnectionTransformerLayer._forward_attention by introducing three override hooks on base TransformerLayer: * _run_input_layernorm — input layernorm with optional output-discarding checkpoint and fine-grained activation offloading. Sets self._input_layernorm_checkpoint_active so the caller can gate the post-self-attn discard hook on the same condition. * _apply_self_attn_bda_step — bias-dropout-add for the self-attention output plus the post-step offload commit. Subclasses override to swap in a fused kernel. * _run_cross_attention — full cross-attention block (layernorm, cross-attention, bda). Base _forward_attention collapses to a 25-line skeleton that calls these hooks. HyperConnectionTransformerLayer drops its own _forward_attention entirely and overrides only _run_input_layernorm (HC pre-wrap + mHC-aware checkpoint, stashes h_res/h_post on self) and _apply_self_attn_bda_step (consumes the stashed h_res/h_post in the fused_h_res_h_post_bda kernel; clears them on exit). Inheriting _run_cross_attention also fixes a latent gap: HC's previous cross-attention path silently skipped tuple-output pre_cross_attn_layernorm unpacking and the fp32_residual_connection cast on the cross-attn residual. In current usage these knobs are not exercised by any mHC config (cross-attention is typically IdentityOp), but inheriting base makes HC correct under those configurations. Verified on hsg with the four mHC-relevant unit-test files: 87 passed, the only 4 failures are pre-existing test_sharded_state_dict[tp_pp0/2] cases that need world_size=8 (environmental, not regressions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>

…attrs Replaces the instance-attribute pattern introduced in 42921d7 (self._h_res, self._h_post stashed in HC._run_input_layernorm and consumed in HC._apply_self_attn_bda_step) with explicit parameter passing through an opaque attn_state slot in the base hook contract. * TransformerLayer._run_input_layernorm now returns (input_layernorm_output, residual, attn_state) where attn_state is () for base — an opaque payload subclasses can use to thread extra intermediates to _apply_self_attn_bda_step. * TransformerLayer._apply_self_attn_bda_step gains an attn_state=() parameter; base ignores it. * TransformerLayer._forward_attention skeleton unpacks the 3-tuple and forwards attn_state to the bda step. * HyperConnectionTransformerLayer._run_input_layernorm returns (input_layernorm_output, residual, (h_res, h_post)). * HyperConnectionTransformerLayer._apply_self_attn_bda_step unpacks h_res, h_post = attn_state; the manual self._h_res/self._h_post clear is gone. * HyperConnectionTransformerLayer.__init__ drops the self._h_res/self._h_post attribute initialization. Net effect: no implicit data flow between methods (h_res/h_post are now visibly parameters), no manual lifetime management, two fewer instance attributes on every HC instance. Verified on hsg with the mHC test suite: 87 passed / 4 failed (identical baseline — the 4 failures are pre-existing test_sharded_state_dict[tp_pp0/2] cases that need world_size=8). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>

Two comment lines above the _set_proj_residual call were dropped during the _forward_attention skeleton refactor in 42921d7. They explained that the call sets up the residual specifically for a fused reduce-scatter + add + layer-norm + all-gather kernel in attention's out_proj — non-obvious from the function name alone, so worth preserving. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>

@mathemakitten

Mirrors the _forward_attention refactor for the MLP path, per @mathemakitten's review feedback on PR NVIDIA#4531. Base TransformerLayer changes: * Add _run_pre_mlp_layernorm: wraps _forward_pre_mlp_layernorm to handle tuple-output unpacking and the fp32_residual_connection cast, and returns (pre_mlp_layernorm_output, residual, mlp_state) where mlp_state is an opaque payload subclasses can use to thread extra intermediates to _apply_mlp_bda_step. Base returns (). * Rename _forward_post_mlp -> _apply_mlp_bda_step and add an mlp_state=() parameter (ignored by base). Updates the three internal callers (the base _forward_mlp skeleton, the cudagraph fallback path, and MoETransformerLayer._forward_mlp_postprocess). * Base _forward_mlp now reads as: _run_pre_mlp_layernorm -> _run_mlp -> MoE-cudagraph branch (unchanged) -> _apply_mlp_bda_step. The MoE early-return stays in base since HC guards against MoE submodules. HyperConnectionTransformerLayer changes: * Drop _forward_mlp (inherited from base). * Drop _forward_post_mlp_with_fused_hyper_connection. * Override _run_pre_mlp_layernorm: HC pre-wrap (mlp_hyper_connection), mHC-aware checkpoint condition, returns the n-stream residual and threads (mlp_h_res, mlp_hc_h_post) via mlp_state. * Override _apply_mlp_bda_step: unpacks the mlp_state slot, computes the mhc_mlp_bda_manager from self._mhc_recompute_manager (None on the last layer of a recompute block, since the block-end finalize hook handles its discard), runs fused_h_res_h_post_bda + viewless wrap. * Update HC.forward to drop the now-unneeded mhc_recompute_manager kwarg on the _forward_mlp call (the manager flows via self). * Extend the comment near self._mhc_recompute_manager to enumerate the four reader methods (_run_input_layernorm, _apply_self_attn_bda_step, _run_pre_mlp_layernorm, _apply_mlp_bda_step) so future maintainers don't try to "clean up" what looks like an unused attribute. Net effect: HC drops ~110 lines of duplicated MLP plumbing in exchange for two ~40-line override methods, and the contract for thread-state between hooks now matches attention exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>

The CheckpointManager → CheckpointWithoutOutputManager rename pushed the docstring for HyperConnectionModule.forward.mhc_recompute_manager from just under to one char over the 100-char pylint limit. Shorten "checkpoint management." → "checkpoint mgmt." to bring it back under. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>

Plain TransformerLayer (and its MoETransformerLayer subclass) doesn't accept the mhc_recompute_manager kwarg — only HyperConnectionTransformerLayer's __call__ pops it. Unconditionally threading it through layer(...) caused TypeError on every non-mHC GPTModel test. Gate the kwarg on `mhc_manager is not None` so it's only passed when the block actually constructed a manager (i.e. mHC is enabled). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>

The MLP-hook extraction in this PR renamed TransformerLayer._forward_post_mlp -> _apply_mlp_bda_step and added an `mlp_state` parameter. The rename silently breaks downstream subclasses that still override the legacy name (e.g. Gemma4TransformerLayer in Megatron-Bridge): the orphaned override is never called, the inherited base implementation runs, and the subclass's post-MLP math (post-FFN layernorm, layer_scalar, etc.) is dropped without any error. Add a one-shot shim at the top of `_apply_mlp_bda_step` that walks the MRO from the concrete class up to TransformerLayer; if any subclass defined the legacy `_forward_post_mlp` in its own __dict__, route through it and emit a DeprecationWarning. The `mlp_state` slot is dropped when delegating, matching the legacy 2-arg contract. To be removed in a future release once known downstream subclasses have migrated to `_apply_mlp_bda_step`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>

Connor-XY · 2026-06-12T17:51:01Z

/ok to test 27fef06

Connor-XY added the Run functional tests label Apr 29, 2026

claude Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread tests/unit_tests/transformer/test_transformer_layer.py

claude Bot reviewed Apr 29, 2026

View reviewed changes

claude Bot approved these changes Apr 29, 2026

View reviewed changes

copy-pr-bot Bot had a problem deploying to test April 29, 2026 19:24 Error

Connor-XY force-pushed the mhc-pr1-core branch from 4a6e07c to 6ee487e Compare April 29, 2026 19:32

Connor-XY marked this pull request as ready for review April 29, 2026 19:33

Connor-XY requested review from a team as code owners April 29, 2026 19:33

svcnvidia-nemo-ci requested a review from a team April 29, 2026 19:33

copy-pr-bot Bot temporarily deployed to test April 29, 2026 19:34 Inactive

svcnvidia-nemo-ci added the complexity: high label Apr 29, 2026

Connor-XY force-pushed the mhc-pr1-core branch from d743722 to 0481591 Compare April 29, 2026 20:46

copy-pr-bot Bot temporarily deployed to test April 29, 2026 20:48 Inactive

Connor-XY requested a review from Phlip79 April 29, 2026 20:48

claude Bot approved these changes Apr 30, 2026

View reviewed changes

Connor-XY force-pushed the mhc-pr1-core branch from 20dbf8e to 461af34 Compare May 18, 2026 17:21

copy-pr-bot Bot temporarily deployed to public May 18, 2026 18:13 Inactive

copy-pr-bot Bot temporarily deployed to test May 18, 2026 18:14 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 18:17 Inactive

ericharper approved these changes May 20, 2026

View reviewed changes

Connor-XY and others added 14 commits June 12, 2026 10:45

Conversation

Connor-XY commented Apr 29, 2026

What does this PR do?

Deferred to follow-up split PRs (NOT in this PR)

Unit-test coverage for components in this PR

Credit

Pre-checks

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

Connor-XY commented Apr 29, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Review — Split 1/N mHC core module

Uh oh!

Connor-XY commented Apr 29, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Connor-XY commented Apr 29, 2026

Uh oh!

Connor-XY commented Apr 29, 2026

Uh oh!

Connor-XY commented Apr 29, 2026

Uh oh!

Connor-XY commented Apr 30, 2026

Uh oh!

Connor-XY commented Apr 30, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Connor-XY commented May 18, 2026

Uh oh!

Connor-XY commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants