Skip to content

[Split 1/N of #3430] feat(mHC): basic pytorch implementation of manifold hyper connection#4531

Open
Connor-XY wants to merge 14 commits into
NVIDIA:mainfrom
Connor-XY:mhc-pr1-core
Open

[Split 1/N of #3430] feat(mHC): basic pytorch implementation of manifold hyper connection#4531
Connor-XY wants to merge 14 commits into
NVIDIA:mainfrom
Connor-XY:mhc-pr1-core

Conversation

@Connor-XY

Copy link
Copy Markdown
Contributor

What does this PR do?

This is the first of a sequence of split PRs derived from #3430
(mirror of upstream #2943 + kernel-fusion #3828). The goal of splitting is
to keep each PR small and limit the number of CODEOWNERS reviewer groups
required, making review faster and easier. Subsequent splits will be
opened only after this one merges — strictly one open at a time.

This PR contains only the core Manifold Hyper Connection (mHC) module
and its directly-supporting changes:

  • megatron/core/transformer/hyper_connection.py (new mHC module)
  • megatron/core/transformer/transformer_block.py (mHC integration into the block)
  • megatron/core/transformer/transformer_layer.py (HyperConnectionTransformerLayer + recompute hooks)
  • megatron/core/transformer/transformer_config.py (new mHC config fields)
  • megatron/core/transformer/__init__.py (re-exports of mHC symbols)
  • megatron/core/transformer/cuda_graphs.py (copyright-year bump only)
  • megatron/core/tensor_parallel/random.py (CheckpointWithoutOutput / CheckpointManager refactor used by mHC recompute)
  • megatron/core/fusions/fused_bias_dropout.py (extends get_bias_dropout_add with an optional mhc_recompute_manager path)

Reviewer groups touched: core-adlr, core-nemo, transformer, cuda-graphs.

Deferred to follow-up split PRs (NOT in this PR)

  • GPT model wiring — gpt_layer_specs.py, experimental specs, gpt_builders.py, training/initialize.py, model/fp8 tests
  • Pipeline-parallel support — pipeline_parallel/schedules.py, test_pp_mhc_compatibility.py
  • Fused mHC cuTile kernels — the #3828 kernel side (fused_mhc_kernels.py, kernel-related fields in transformer_config.py, kernel-using paths in hyper_connection.py, test_fused_mhc_kernels.py)
  • Functional-test recipe — tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp2_pp2_mhc/*, tests/test_utils/recipes/h100/gpt.yaml

Unit-test coverage for components in this PR

Every component added or modified by this PR has a corresponding unit test
included in this same PR:

Component Unit test (in this PR)
hyper_connection.py (new module) tests/unit_tests/transformer/test_hyper_connection_recompute.py, tests/unit_tests/transformer/test_mhc_block_manager.py, tests/unit_tests/transformer/test_transformer_layer.py (HyperConnectionTransformerLayer cases)
transformer_layer.py (mHC variant + recompute hooks) tests/unit_tests/transformer/test_transformer_layer.py
transformer_block.py (mHC integration) tests/unit_tests/transformer/test_mhc_block_manager.py, tests/unit_tests/transformer/test_transformer_layer.py
transformer_config.py (new mHC fields) exercised by configs in tests/unit_tests/transformer/test_transformer_layer.py
tensor_parallel/random.py (CheckpointWithoutOutput + ckpt_manager) tests/unit_tests/transformer/test_hyper_connection_recompute.py
fusions/fused_bias_dropout.py (mhc_recompute_manager path) exercised indirectly via the mHC MLP/recompute path in tests/unit_tests/transformer/test_hyper_connection_recompute.py and tests/unit_tests/transformer/test_transformer_layer.py
transformer/__init__.py export-only (no dedicated test — symbols are imported and exercised by the tests above)
transformer/cuda_graphs.py copyright-year bump only (no test required)

The functional test (golden-values + model_config) for the end-to-end mHC
configuration lives in the deferred functional-test split PR, since that
file is @NVIDIA/ci-owned and orthogonal to the transformer reviewers.

Credit

All changes here are a re-slice of @jingqiny-99's work in #3430
(upstream basic pytorch implementation #2943). No new functionality has
been added, removed, or rewritten in this split — only repackaged for
narrower review scope.

Pre-checks

  • Re-slice of an existing PR; no new feature
  • Original unit tests included for all components touched
  • Functional test deferred to a later split PR (per split plan)
  • PR opened as draft per repo contributing guidelines

@copy-pr-bot

copy-pr-bot Bot commented Apr 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Connor-XY

Copy link
Copy Markdown
Contributor Author

/claude review

Comment thread tests/unit_tests/transformer/test_transformer_layer.py

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Split 1/N mHC core module

The core mHC implementation, CheckpointManager, _save_args_to_ctx/_load_args_from_ctx refactor, and the HyperConnectionTransformerLayer all look correct. The recompute plan in TransformerBlock, Sinkhorn-Knopp backward via recomputation, and the unified gradient hook design are clean.

One critical issue: all new test classes in test_transformer_layer.py call get_gpt_layer_with_transformer_engine_spec(enable_hyper_connection=True), but gpt_layer_specs.py is not modified in this PR and does not accept that parameter. This will cause every new test class (TestTransformerLayerWithHyperConnectionRecompute, TestMHCRecomputeMemorySaving, TestMHCWithCudaGraph, TestMHCWithOffloading) to fail with a TypeError. See inline comment for suggested fixes.

The tests in test_hyper_connection_recompute.py and test_mhc_block_manager.py construct modules directly and should be unaffected.

Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request Apr 29, 2026


Address Claude review feedback on NVIDIA#4531: the new mHC test classes called
get_gpt_layer_with_transformer_engine_spec(enable_hyper_connection=True),
but the enable_hyper_connection kwarg on gpt_layer_specs.py is added in
the deferred GPT-wiring follow-up split. As-is, the tests would have
failed at construction with TypeError, and even past that would have
tripped HyperConnectionTransformerLayer's IdentityOp assertion on the
mHC submodules.

Add a small _make_mhc_layer_spec helper that calls the unmodified
get_gpt_layer_with_transformer_engine_spec() and patches in
HyperConnectionModule for self_attention_hyper_connection and
mlp_hyper_connection, so the tests are self-contained for this split.
The five call sites in the new mHC test classes
(TestTransformerLayerWithHyperConnectionRecompute,
TestMHCRecomputeMemorySaving, TestMHCWithCudaGraph, TestMHCWithOffloading)
are updated to use the helper.
@Connor-XY

Copy link
Copy Markdown
Contributor Author

/claude review

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Connor-XY

Copy link
Copy Markdown
Contributor Author

/ok to test 4a6e07c

Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request Apr 29, 2026


Address Claude review feedback on NVIDIA#4531: the new mHC test classes called
get_gpt_layer_with_transformer_engine_spec(enable_hyper_connection=True),
but the enable_hyper_connection kwarg on gpt_layer_specs.py is added in
the deferred GPT-wiring follow-up split. As-is, the tests would have
failed at construction with TypeError, and even past that would have
tripped HyperConnectionTransformerLayer's IdentityOp assertion on the
mHC submodules.

Add a small _make_mhc_layer_spec helper that calls the unmodified
get_gpt_layer_with_transformer_engine_spec() and patches in
HyperConnectionModule for self_attention_hyper_connection and
mlp_hyper_connection, so the tests are self-contained for this split.
The five call sites in the new mHC test classes
(TestTransformerLayerWithHyperConnectionRecompute,
TestMHCRecomputeMemorySaving, TestMHCWithCudaGraph, TestMHCWithOffloading)
are updated to use the helper.
@Connor-XY

Copy link
Copy Markdown
Contributor Author

/ok to test 6ee487e

@Connor-XY Connor-XY marked this pull request as ready for review April 29, 2026 19:33
@Connor-XY Connor-XY requested review from a team as code owners April 29, 2026 19:33
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 29, 2026 19:33
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request Apr 29, 2026


Address Claude review feedback on NVIDIA#4531: the new mHC test classes called
get_gpt_layer_with_transformer_engine_spec(enable_hyper_connection=True),
but the enable_hyper_connection kwarg on gpt_layer_specs.py is added in
the deferred GPT-wiring follow-up split. As-is, the tests would have
failed at construction with TypeError, and even past that would have
tripped HyperConnectionTransformerLayer's IdentityOp assertion on the
mHC submodules.

Add a small _make_mhc_layer_spec helper that calls the unmodified
get_gpt_layer_with_transformer_engine_spec() and patches in
HyperConnectionModule for self_attention_hyper_connection and
mlp_hyper_connection, so the tests are self-contained for this split.
The five call sites in the new mHC test classes
(TestTransformerLayerWithHyperConnectionRecompute,
TestMHCRecomputeMemorySaving, TestMHCWithCudaGraph, TestMHCWithOffloading)
are updated to use the helper.
@Connor-XY

Copy link
Copy Markdown
Contributor Author

/ok to test 0481591

@Connor-XY Connor-XY requested a review from Phlip79 April 29, 2026 20:48
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request Apr 30, 2026
… forward

The Codecov patch-coverage gate on NVIDIA#4531 fails (68.25% < 80% target) because
two new code paths in this split lack any test:

  * fused_bias_dropout._get_checkpointed_bda — only reachable via the
    mHC layer's recompute path, which the unit-test lane doesn't exercise.
  * transformer_block._build_mhc_recompute_layer_plan,
    _finalize_mhc_recompute_layer, and the input_expand / output_contract
    calls in TransformerBlock.forward — only reachable from a full block
    forward with mHC selective recompute enabled.

Neither original test exists in PR NVIDIA#3430:
  * No test in the upstream PR references get_bias_dropout_add or
    _get_checkpointed_bda.
  * The block-level mHC forward test in NVIDIA#3430 (TestPPForwardWithMHC) is
    multi-GPU-only (torchrun --nproc-per-node=2+) and so wouldn't run
    in the codecov-feeding unit-test lane even if lifted.

Add direct unit coverage:

  tests/unit_tests/fusions/test_bias_dropout_fusion.py
    TestBiasDropoutAddMhcRecompute
      test_checkpointed_bda_forward_backward (parametrized fused/with_bias)
      test_checkpointed_bda_chained_managers
      test_get_bda_without_manager_unchanged

  tests/unit_tests/transformer/test_mhc_block_manager.py
    TestTransformerBlockMHCRecompute
      test_recompute_plan_no_layer_num
      test_recompute_plan_with_layer_num
      test_recompute_plan_disabled
      test_block_forward_input_expand_output_contract

Combined, these directly hit ~80 of the previously-missed lines in
fused_bias_dropout.py (~14 of 19 missed) and transformer_block.py
(~12 of 16 missed), which should bring patch coverage above the 80%
threshold.
@Connor-XY

Copy link
Copy Markdown
Contributor Author

/claude review

@Connor-XY

Copy link
Copy Markdown
Contributor Author

/ok to test 54d47ce

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request May 18, 2026


Address Claude review feedback on NVIDIA#4531: the new mHC test classes called
get_gpt_layer_with_transformer_engine_spec(enable_hyper_connection=True),
but the enable_hyper_connection kwarg on gpt_layer_specs.py is added in
the deferred GPT-wiring follow-up split. As-is, the tests would have
failed at construction with TypeError, and even past that would have
tripped HyperConnectionTransformerLayer's IdentityOp assertion on the
mHC submodules.

Add a small _make_mhc_layer_spec helper that calls the unmodified
get_gpt_layer_with_transformer_engine_spec() and patches in
HyperConnectionModule for self_attention_hyper_connection and
mlp_hyper_connection, so the tests are self-contained for this split.
The five call sites in the new mHC test classes
(TestTransformerLayerWithHyperConnectionRecompute,
TestMHCRecomputeMemorySaving, TestMHCWithCudaGraph, TestMHCWithOffloading)
are updated to use the helper.
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request May 18, 2026
… forward

The Codecov patch-coverage gate on NVIDIA#4531 fails (68.25% < 80% target) because
two new code paths in this split lack any test:

  * fused_bias_dropout._get_checkpointed_bda — only reachable via the
    mHC layer's recompute path, which the unit-test lane doesn't exercise.
  * transformer_block._build_mhc_recompute_layer_plan,
    _finalize_mhc_recompute_layer, and the input_expand / output_contract
    calls in TransformerBlock.forward — only reachable from a full block
    forward with mHC selective recompute enabled.

Neither original test exists in PR NVIDIA#3430:
  * No test in the upstream PR references get_bias_dropout_add or
    _get_checkpointed_bda.
  * The block-level mHC forward test in NVIDIA#3430 (TestPPForwardWithMHC) is
    multi-GPU-only (torchrun --nproc-per-node=2+) and so wouldn't run
    in the codecov-feeding unit-test lane even if lifted.

Add direct unit coverage:

  tests/unit_tests/fusions/test_bias_dropout_fusion.py
    TestBiasDropoutAddMhcRecompute
      test_checkpointed_bda_forward_backward (parametrized fused/with_bias)
      test_checkpointed_bda_chained_managers
      test_get_bda_without_manager_unchanged

  tests/unit_tests/transformer/test_mhc_block_manager.py
    TestTransformerBlockMHCRecompute
      test_recompute_plan_no_layer_num
      test_recompute_plan_with_layer_num
      test_recompute_plan_disabled
      test_block_forward_input_expand_output_contract

Combined, these directly hit ~80 of the previously-missed lines in
fused_bias_dropout.py (~14 of 19 missed) and transformer_block.py
(~12 of 16 missed), which should bring patch coverage above the 80%
threshold.
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request May 18, 2026
Resolves all eleven comments on the PR thread:

* Rename CheckpointManager → CheckpointWithoutOutputManager and update
  the docstring; the class strictly manages CheckpointWithoutOutput
  instances, so the new name avoids the broader "checkpoint" overloading.
  Updates all importers and tests. (NVIDIA#1)
* Document why subtracting the per-row max in SinkhornKnopp.forward is
  benign — Sinkhorn's first row-normalization cancels any per-row
  scalar, so the shifted and unshifted exp produce the same fixed point
  and gradient. (NVIDIA#2)
* Use NotImplementedError for the mhc + fine_grained_activation_offloading
  block — it's a known unimplemented interaction, not a config error. (NVIDIA#3)
* Drop the new __call__ override and backward_dw_cudagraph from base
  TransformerLayer; the mHC kwarg extraction now lives on
  HyperConnectionTransformerLayer.__call__, with _mhc_recompute_manager
  initialized in __init__ so forward() reads it directly without a
  getattr fallback. cuda_graphs.py reads is_decode_only() directly,
  so dropping the dynamic_inference_decode_only injection is safe. (NVIDIA#4, NVIDIA#5, NVIDIA#10)
* Rename the FineGrainedActivationOffloadingInterface alias
  off_interface → offload_interface in transformer_layer.py for clarity. (NVIDIA#6)
* Extract a _run_mlp helper on TransformerLayer that owns the MLP-call
  branching (recompute / chunked-prefill / fp8-fp4 / plain-mlp); both
  base and HC _forward_mlp call it, eliminating the previous
  ~80-line duplication. The MoE-cudagraph early-return remains in base
  _forward_mlp after the helper call (HC is guarded against MoE). (NVIDIA#8)
* Raise NotImplementedError at HyperConnectionTransformerLayer.__init__
  when is_moe_layer is True and point users at HyperConnectionHybridLayer;
  drop the dead MoE branch in _get_submodules_under_cudagraphs. (NVIDIA#9)
* No code change for the MoE composition / extensibility comment (NVIDIA#7) —
  see the PR thread reply for the rationale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request May 18, 2026
Mirrors the _forward_attention refactor for the MLP path, per
@mathemakitten's review feedback on PR NVIDIA#4531.

Base TransformerLayer changes:
* Add _run_pre_mlp_layernorm: wraps _forward_pre_mlp_layernorm to
  handle tuple-output unpacking and the fp32_residual_connection cast,
  and returns (pre_mlp_layernorm_output, residual, mlp_state) where
  mlp_state is an opaque payload subclasses can use to thread extra
  intermediates to _apply_mlp_bda_step. Base returns ().
* Rename _forward_post_mlp -> _apply_mlp_bda_step and add an mlp_state=()
  parameter (ignored by base). Updates the three internal callers (the
  base _forward_mlp skeleton, the cudagraph fallback path, and
  MoETransformerLayer._forward_mlp_postprocess).
* Base _forward_mlp now reads as: _run_pre_mlp_layernorm -> _run_mlp ->
  MoE-cudagraph branch (unchanged) -> _apply_mlp_bda_step. The MoE
  early-return stays in base since HC guards against MoE submodules.

HyperConnectionTransformerLayer changes:
* Drop _forward_mlp (inherited from base).
* Drop _forward_post_mlp_with_fused_hyper_connection.
* Override _run_pre_mlp_layernorm: HC pre-wrap (mlp_hyper_connection),
  mHC-aware checkpoint condition, returns the n-stream residual and
  threads (mlp_h_res, mlp_hc_h_post) via mlp_state.
* Override _apply_mlp_bda_step: unpacks the mlp_state slot, computes the
  mhc_mlp_bda_manager from self._mhc_recompute_manager (None on the
  last layer of a recompute block, since the block-end finalize hook
  handles its discard), runs fused_h_res_h_post_bda + viewless wrap.
* Update HC.forward to drop the now-unneeded mhc_recompute_manager
  kwarg on the _forward_mlp call (the manager flows via self).
* Extend the comment near self._mhc_recompute_manager to enumerate the
  four reader methods (_run_input_layernorm, _apply_self_attn_bda_step,
  _run_pre_mlp_layernorm, _apply_mlp_bda_step) so future maintainers
  don't try to "clean up" what looks like an unused attribute.

Net effect: HC drops ~110 lines of duplicated MLP plumbing in exchange
for two ~40-line override methods, and the contract for thread-state
between hooks now matches attention exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Connor-XY

Copy link
Copy Markdown
Contributor Author

/ok to test 461af34

Connor-XY and others added 14 commits June 12, 2026 10:45
… manifold hyper connection

Adds the core Manifold Hyper Connection (mHC) module and the supporting
transformer-block / transformer-layer / config / recompute changes. This
is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828),
covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs
reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile
kernels, and the functional-test recipe are deferred to follow-up split PRs.

Original work by @jingqiny-99 in NVIDIA#3430
(upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Signed-off-by: Yan Xu <yxu1@nvidia.com>


Address Claude review feedback on NVIDIA#4531: the new mHC test classes called
get_gpt_layer_with_transformer_engine_spec(enable_hyper_connection=True),
but the enable_hyper_connection kwarg on gpt_layer_specs.py is added in
the deferred GPT-wiring follow-up split. As-is, the tests would have
failed at construction with TypeError, and even past that would have
tripped HyperConnectionTransformerLayer's IdentityOp assertion on the
mHC submodules.

Add a small _make_mhc_layer_spec helper that calls the unmodified
get_gpt_layer_with_transformer_engine_spec() and patches in
HyperConnectionModule for self_attention_hyper_connection and
mlp_hyper_connection, so the tests are self-contained for this split.
The five call sites in the new mHC test classes
(TestTransformerLayerWithHyperConnectionRecompute,
TestMHCRecomputeMemorySaving, TestMHCWithCudaGraph, TestMHCWithOffloading)
are updated to use the helper.

Signed-off-by: Yan Xu <yxu1@nvidia.com>
CI run #25129636936 (job tests/unit_tests/models/**/*.py) failed with
"Mamba MoE config drift detected!" because PR NVIDIA#1 adds five new
TransformerConfig fields (enable_hyper_connections, mhc_init_gating_factor,
mhc_recompute_layer_num, mhc_sinkhorn_iterations, num_residual_streams) but
this PR previously deferred the corresponding test_mamba_moe_model.py
GOLDEN_CONFIG update (the file has since been renamed to
test_hybrid_moe_model.py on main).

This is a generic config-drift sentinel — any PR that adds TransformerConfig
fields must update this golden snapshot — so it belongs in this split, not
in a deferred follow-up. Add the five keys with the same default values used
in the original PR NVIDIA#3430 / upstream NVIDIA#2943.

Signed-off-by: Yan Xu <yxu1@nvidia.com>
… forward

The Codecov patch-coverage gate on NVIDIA#4531 fails (68.25% < 80% target) because
two new code paths in this split lack any test:

  * fused_bias_dropout._get_checkpointed_bda — only reachable via the
    mHC layer's recompute path, which the unit-test lane doesn't exercise.
  * transformer_block._build_mhc_recompute_layer_plan,
    _finalize_mhc_recompute_layer, and the input_expand / output_contract
    calls in TransformerBlock.forward — only reachable from a full block
    forward with mHC selective recompute enabled.

Neither original test exists in PR NVIDIA#3430:
  * No test in the upstream PR references get_bias_dropout_add or
    _get_checkpointed_bda.
  * The block-level mHC forward test in NVIDIA#3430 (TestPPForwardWithMHC) is
    multi-GPU-only (torchrun --nproc-per-node=2+) and so wouldn't run
    in the codecov-feeding unit-test lane even if lifted.

Add direct unit coverage:

  tests/unit_tests/fusions/test_bias_dropout_fusion.py
    TestBiasDropoutAddMhcRecompute
      test_checkpointed_bda_forward_backward (parametrized fused/with_bias)
      test_checkpointed_bda_chained_managers
      test_get_bda_without_manager_unchanged

  tests/unit_tests/transformer/test_mhc_block_manager.py
    TestTransformerBlockMHCRecompute
      test_recompute_plan_no_layer_num
      test_recompute_plan_with_layer_num
      test_recompute_plan_disabled
      test_block_forward_input_expand_output_contract

Combined, these directly hit ~80 of the previously-missed lines in
fused_bias_dropout.py (~14 of 19 missed) and transformer_block.py
(~12 of 16 missed), which should bring patch coverage above the 80%
threshold.

Signed-off-by: Yan Xu <yxu1@nvidia.com>
Lint failure in CI run #25142652872 (job 73695715090): black wanted the
attention_mask construction in test_block_forward_input_expand_output_contract
on a single line — at 91 chars it fits the project's line_length=100 limit.

Signed-off-by: Yan Xu <yxu1@nvidia.com>
Resolves all eleven comments on the PR thread:

* Rename CheckpointManager → CheckpointWithoutOutputManager and update
  the docstring; the class strictly manages CheckpointWithoutOutput
  instances, so the new name avoids the broader "checkpoint" overloading.
  Updates all importers and tests. (NVIDIA#1)
* Document why subtracting the per-row max in SinkhornKnopp.forward is
  benign — Sinkhorn's first row-normalization cancels any per-row
  scalar, so the shifted and unshifted exp produce the same fixed point
  and gradient. (NVIDIA#2)
* Use NotImplementedError for the mhc + fine_grained_activation_offloading
  block — it's a known unimplemented interaction, not a config error. (NVIDIA#3)
* Drop the new __call__ override and backward_dw_cudagraph from base
  TransformerLayer; the mHC kwarg extraction now lives on
  HyperConnectionTransformerLayer.__call__, with _mhc_recompute_manager
  initialized in __init__ so forward() reads it directly without a
  getattr fallback. cuda_graphs.py reads is_decode_only() directly,
  so dropping the dynamic_inference_decode_only injection is safe. (NVIDIA#4, NVIDIA#5, NVIDIA#10)
* Rename the FineGrainedActivationOffloadingInterface alias
  off_interface → offload_interface in transformer_layer.py for clarity. (NVIDIA#6)
* Extract a _run_mlp helper on TransformerLayer that owns the MLP-call
  branching (recompute / chunked-prefill / fp8-fp4 / plain-mlp); both
  base and HC _forward_mlp call it, eliminating the previous
  ~80-line duplication. The MoE-cudagraph early-return remains in base
  _forward_mlp after the helper call (HC is guarded against MoE). (NVIDIA#8)
* Raise NotImplementedError at HyperConnectionTransformerLayer.__init__
  when is_moe_layer is True and point users at HyperConnectionHybridLayer;
  drop the dead MoE branch in _get_submodules_under_cudagraphs. (NVIDIA#9)
* No code change for the MoE composition / extensibility comment (NVIDIA#7) —
  see the PR thread reply for the rationale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yan Xu <yxu1@nvidia.com>
The pop was defensive cleanup against an older signature. No caller in
the repo passes is_last_layer_in_recompute_block as a kwarg today —
TransformerBlock sets it as an attribute on the manager object before
each layer call (see transformer_block.py around line 886), and the
forward path reads it via getattr on the manager. Removing the dead pop
per follow-up review feedback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yan Xu <yxu1@nvidia.com>
Resolves the latent duplication between TransformerLayer._forward_attention
and HyperConnectionTransformerLayer._forward_attention by introducing three
override hooks on base TransformerLayer:

* _run_input_layernorm — input layernorm with optional output-discarding
  checkpoint and fine-grained activation offloading. Sets
  self._input_layernorm_checkpoint_active so the caller can gate the
  post-self-attn discard hook on the same condition.
* _apply_self_attn_bda_step — bias-dropout-add for the self-attention
  output plus the post-step offload commit. Subclasses override to swap in
  a fused kernel.
* _run_cross_attention — full cross-attention block (layernorm,
  cross-attention, bda).

Base _forward_attention collapses to a 25-line skeleton that calls these
hooks. HyperConnectionTransformerLayer drops its own _forward_attention
entirely and overrides only _run_input_layernorm (HC pre-wrap +
mHC-aware checkpoint, stashes h_res/h_post on self) and
_apply_self_attn_bda_step (consumes the stashed h_res/h_post in the
fused_h_res_h_post_bda kernel; clears them on exit).

Inheriting _run_cross_attention also fixes a latent gap: HC's previous
cross-attention path silently skipped tuple-output pre_cross_attn_layernorm
unpacking and the fp32_residual_connection cast on the cross-attn
residual. In current usage these knobs are not exercised by any mHC
config (cross-attention is typically IdentityOp), but inheriting base
makes HC correct under those configurations.

Verified on hsg with the four mHC-relevant unit-test files: 87 passed,
the only 4 failures are pre-existing test_sharded_state_dict[tp_pp0/2]
cases that need world_size=8 (environmental, not regressions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yan Xu <yxu1@nvidia.com>
…attrs

Replaces the instance-attribute pattern introduced in 42921d7
(self._h_res, self._h_post stashed in HC._run_input_layernorm and
consumed in HC._apply_self_attn_bda_step) with explicit parameter
passing through an opaque attn_state slot in the base hook contract.

* TransformerLayer._run_input_layernorm now returns
  (input_layernorm_output, residual, attn_state) where attn_state is
  () for base — an opaque payload subclasses can use to thread extra
  intermediates to _apply_self_attn_bda_step.
* TransformerLayer._apply_self_attn_bda_step gains an attn_state=()
  parameter; base ignores it.
* TransformerLayer._forward_attention skeleton unpacks the 3-tuple and
  forwards attn_state to the bda step.
* HyperConnectionTransformerLayer._run_input_layernorm returns
  (input_layernorm_output, residual, (h_res, h_post)).
* HyperConnectionTransformerLayer._apply_self_attn_bda_step unpacks
  h_res, h_post = attn_state; the manual self._h_res/self._h_post clear
  is gone.
* HyperConnectionTransformerLayer.__init__ drops the
  self._h_res/self._h_post attribute initialization.

Net effect: no implicit data flow between methods (h_res/h_post are
now visibly parameters), no manual lifetime management, two fewer
instance attributes on every HC instance.

Verified on hsg with the mHC test suite: 87 passed / 4 failed
(identical baseline — the 4 failures are pre-existing
test_sharded_state_dict[tp_pp0/2] cases that need world_size=8).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yan Xu <yxu1@nvidia.com>
Two comment lines above the _set_proj_residual call were dropped during
the _forward_attention skeleton refactor in 42921d7. They explained
that the call sets up the residual specifically for a fused
reduce-scatter + add + layer-norm + all-gather kernel in attention's
out_proj — non-obvious from the function name alone, so worth
preserving.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yan Xu <yxu1@nvidia.com>
Mirrors the _forward_attention refactor for the MLP path, per
@mathemakitten's review feedback on PR NVIDIA#4531.

Base TransformerLayer changes:
* Add _run_pre_mlp_layernorm: wraps _forward_pre_mlp_layernorm to
  handle tuple-output unpacking and the fp32_residual_connection cast,
  and returns (pre_mlp_layernorm_output, residual, mlp_state) where
  mlp_state is an opaque payload subclasses can use to thread extra
  intermediates to _apply_mlp_bda_step. Base returns ().
* Rename _forward_post_mlp -> _apply_mlp_bda_step and add an mlp_state=()
  parameter (ignored by base). Updates the three internal callers (the
  base _forward_mlp skeleton, the cudagraph fallback path, and
  MoETransformerLayer._forward_mlp_postprocess).
* Base _forward_mlp now reads as: _run_pre_mlp_layernorm -> _run_mlp ->
  MoE-cudagraph branch (unchanged) -> _apply_mlp_bda_step. The MoE
  early-return stays in base since HC guards against MoE submodules.

HyperConnectionTransformerLayer changes:
* Drop _forward_mlp (inherited from base).
* Drop _forward_post_mlp_with_fused_hyper_connection.
* Override _run_pre_mlp_layernorm: HC pre-wrap (mlp_hyper_connection),
  mHC-aware checkpoint condition, returns the n-stream residual and
  threads (mlp_h_res, mlp_hc_h_post) via mlp_state.
* Override _apply_mlp_bda_step: unpacks the mlp_state slot, computes the
  mhc_mlp_bda_manager from self._mhc_recompute_manager (None on the
  last layer of a recompute block, since the block-end finalize hook
  handles its discard), runs fused_h_res_h_post_bda + viewless wrap.
* Update HC.forward to drop the now-unneeded mhc_recompute_manager
  kwarg on the _forward_mlp call (the manager flows via self).
* Extend the comment near self._mhc_recompute_manager to enumerate the
  four reader methods (_run_input_layernorm, _apply_self_attn_bda_step,
  _run_pre_mlp_layernorm, _apply_mlp_bda_step) so future maintainers
  don't try to "clean up" what looks like an unused attribute.

Net effect: HC drops ~110 lines of duplicated MLP plumbing in exchange
for two ~40-line override methods, and the contract for thread-state
between hooks now matches attention exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yan Xu <yxu1@nvidia.com>
The CheckpointManager → CheckpointWithoutOutputManager rename pushed the
docstring for HyperConnectionModule.forward.mhc_recompute_manager from
just under to one char over the 100-char pylint limit. Shorten
"checkpoint management." → "checkpoint mgmt." to bring it back under.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yan Xu <yxu1@nvidia.com>
Plain TransformerLayer (and its MoETransformerLayer subclass) doesn't
accept the mhc_recompute_manager kwarg — only HyperConnectionTransformerLayer's
__call__ pops it. Unconditionally threading it through layer(...) caused
TypeError on every non-mHC GPTModel test.

Gate the kwarg on `mhc_manager is not None` so it's only passed when the
block actually constructed a manager (i.e. mHC is enabled).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yan Xu <yxu1@nvidia.com>
The MLP-hook extraction in this PR renamed
TransformerLayer._forward_post_mlp -> _apply_mlp_bda_step and added an
`mlp_state` parameter. The rename silently breaks downstream subclasses
that still override the legacy name (e.g. Gemma4TransformerLayer in
Megatron-Bridge): the orphaned override is never called, the inherited
base implementation runs, and the subclass's post-MLP math (post-FFN
layernorm, layer_scalar, etc.) is dropped without any error.

Add a one-shot shim at the top of `_apply_mlp_bda_step` that walks the
MRO from the concrete class up to TransformerLayer; if any subclass
defined the legacy `_forward_post_mlp` in its own __dict__, route
through it and emit a DeprecationWarning. The `mlp_state` slot is
dropped when delegating, matching the legacy 2-arg contract.

To be removed in a future release once known downstream subclasses have
migrated to `_apply_mlp_bda_step`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yan Xu <yxu1@nvidia.com>
@Connor-XY

Copy link
Copy Markdown
Contributor Author

/ok to test 27fef06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

complexity: high deepseekv4 DeepSeek V4 PRs Final Review PR is in the "final review" stage Run functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants