Skip to content

[dev] mHC kernel fusion #3828

Merged
jingqiny-99 merged 17 commits into
NVIDIA:devfrom
jingqiny-99:jingqiny/feature-mHC-fusion
Mar 25, 2026
Merged

[dev] mHC kernel fusion #3828
jingqiny-99 merged 17 commits into
NVIDIA:devfrom
jingqiny-99:jingqiny/feature-mHC-fusion

Conversation

@jingqiny-99

@jingqiny-99 jingqiny-99 commented Mar 12, 2026

Copy link
Copy Markdown

What does this PR do ?

kernel fusion (using cuTile) for mHC.

Verified using 27B Model (introduced in original mHC paper) with GB200

Throughput:
image

Validation LM Loss:
image

Iteration Time:
image

Kernel Performance

Config: s=8192, b=4, n=4, C=7168

Kernel Framework Fwd (ms) Bwd (ms) E2E (ms) Speedup
sinkhorn compiled 0.346 1.018 1.364 1.00x
sinkhorn cutile 0.065 0.133 0.198 6.89x
h_aggregate compiled 0.338 0.808 1.146 1.00x
h_aggregate cutile 0.349 0.661 1.010 1.13x
h_post_bda compiled 3.829 6.663 10.492 1.00x
h_post_bda cutile 0.684 2.549 3.234 3.24x
proj_rms compiled 0.554 2.204 2.758 1.00x
proj_rms cutile 0.426 1.541 1.967 1.40x

Prerequisites

Linux x86_64, Linux aarch64 or Windows x86_64

A GPU with compute capability 8.x 10.x, 11.x or 12.x

NVIDIA Driver r580 or later

Python version 3.10, 3.11, 3.12 or 3.13

Usage

Install CuTile: pip install cuda-tile[tileiras]

Enable fusion kernel with --use-fused-mhc

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

@copy-pr-bot

copy-pr-bot Bot commented Mar 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jingqiny-99

Copy link
Copy Markdown
Author

/ok to test b4e6d8b

@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 12, 2026
@jingqiny-99

Copy link
Copy Markdown
Author

/ok to test 28bf50e

@jingqiny-99

Copy link
Copy Markdown
Author

/claude review

Comment thread megatron/core/fusions/fused_mhc_kernels.py
Comment thread megatron/core/fusions/fused_mhc_kernels.py Outdated
@jingqiny-99

Copy link
Copy Markdown
Author

/ok to test 910f985

@jingqiny-99

Copy link
Copy Markdown
Author

/claude review

Comment thread megatron/core/fusions/fused_mhc_kernels.py Outdated
Comment thread megatron/core/transformer/hyper_connection.py Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues found:

  1. Sinkhorn fallback path missing numerical stability (fused_mhc_kernels.py:64): The _ref_sinkhorn_fwd torch.compile fallback uses bare torch.exp(input_logits) without the row-max subtraction that both the cuTile kernel and the original SinkhornKnopp class use. This can overflow for large logit values, especially in bf16.

  2. Sinkhorn eps silently changed from 1e-6 to 1e-8 (hyper_connection.py:152): The original SinkhornKnopp.eps was 1e-6, but fused_sinkhorn defaults to 1e-8 and the call site doesn't pass eps. This is a behavioral change that may explain some of the golden value drift.

Also worth double-checking: the num-zeros golden values show large jumps at several steps (e.g., step 8: 1580→34401, step 14: 1777→34591) — roughly 20x increases. This pattern is intermittent (some steps are normal, some are ~34k). Worth confirming this is expected and not an artifact of the numerical stability issue above.

@jingqiny-99

Copy link
Copy Markdown
Author

/ok to test 0416b10

@jingqiny-99

Copy link
Copy Markdown
Author

/claude review

Comment thread megatron/core/fusions/fused_mhc_kernels.py
@jingqiny-99

Copy link
Copy Markdown
Author

/ok to test 578f5dd

@jingqiny-99

Copy link
Copy Markdown
Author

/ok to test 296a892

@jingqiny-99 jingqiny-99 marked this pull request as ready for review March 13, 2026 06:20
@jingqiny-99 jingqiny-99 added this pull request to the merge queue Mar 25, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23527281274

Merged via the queue into NVIDIA:dev with commit 4108d68 Mar 25, 2026
111 checks passed
@jingqiny-99 jingqiny-99 deleted the jingqiny/feature-mHC-fusion branch March 25, 2026 06:41
@jingqiny-99 jingqiny-99 restored the jingqiny/feature-mHC-fusion branch March 25, 2026 06:55
jingqiny-99 added a commit to jingqiny-99/Megatron-LM that referenced this pull request Mar 31, 2026
@sbhavani sbhavani mentioned this pull request Apr 24, 2026
3 tasks
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request Apr 29, 2026
… manifold hyper connection

Adds the core Manifold Hyper Connection (mHC) module and the supporting
transformer-block / transformer-layer / config / recompute changes. This
is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828),
covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs
reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile
kernels, and the functional-test recipe are deferred to follow-up split PRs.

Original work by @jingqiny-99 in NVIDIA#3430
(upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request Apr 29, 2026
… manifold hyper connection

Adds the core Manifold Hyper Connection (mHC) module and the supporting
transformer-block / transformer-layer / config / recompute changes. This
is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828),
covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs
reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile
kernels, and the functional-test recipe are deferred to follow-up split PRs.

Original work by @jingqiny-99 in NVIDIA#3430
(upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request May 4, 2026
… manifold hyper connection

Adds the core Manifold Hyper Connection (mHC) module and the supporting
transformer-block / transformer-layer / config / recompute changes. This
is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828),
covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs
reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile
kernels, and the functional-test recipe are deferred to follow-up split PRs.

Original work by @jingqiny-99 in NVIDIA#3430
(upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request May 6, 2026
… manifold hyper connection

Adds the core Manifold Hyper Connection (mHC) module and the supporting
transformer-block / transformer-layer / config / recompute changes. This
is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828),
covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs
reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile
kernels, and the functional-test recipe are deferred to follow-up split PRs.

Original work by @jingqiny-99 in NVIDIA#3430
(upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request May 7, 2026
… manifold hyper connection

Adds the core Manifold Hyper Connection (mHC) module and the supporting
transformer-block / transformer-layer / config / recompute changes. This
is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828),
covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs
reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile
kernels, and the functional-test recipe are deferred to follow-up split PRs.

Original work by @jingqiny-99 in NVIDIA#3430
(upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request May 11, 2026
… manifold hyper connection

Adds the core Manifold Hyper Connection (mHC) module and the supporting
transformer-block / transformer-layer / config / recompute changes. This
is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828),
covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs
reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile
kernels, and the functional-test recipe are deferred to follow-up split PRs.

Original work by @jingqiny-99 in NVIDIA#3430
(upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request May 14, 2026
… manifold hyper connection

Adds the core Manifold Hyper Connection (mHC) module and the supporting
transformer-block / transformer-layer / config / recompute changes. This
is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828),
covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs
reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile
kernels, and the functional-test recipe are deferred to follow-up split PRs.

Original work by @jingqiny-99 in NVIDIA#3430
(upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request May 18, 2026
… manifold hyper connection

Adds the core Manifold Hyper Connection (mHC) module and the supporting
transformer-block / transformer-layer / config / recompute changes. This
is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828),
covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs
reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile
kernels, and the functional-test recipe are deferred to follow-up split PRs.

Original work by @jingqiny-99 in NVIDIA#3430
(upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request Jun 12, 2026
… manifold hyper connection

Adds the core Manifold Hyper Connection (mHC) module and the supporting
transformer-block / transformer-layer / config / recompute changes. This
is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828),
covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs
reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile
kernels, and the functional-test recipe are deferred to follow-up split PRs.

Original work by @jingqiny-99 in NVIDIA#3430
(upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request Jun 12, 2026
Fourth split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828).
Adds the cuTile-based fused implementation of the four mHC primitives
(Sinkhorn-Knopp, H-aggregate, H_post * BDA, projection + RMS) plus the
native PyTorch reference versions used as a fallback and for unit-test
comparison.

Files:
  - megatron/core/fusions/fused_mhc_kernels.py: new — cuTile kernel
    module; gated by cuda.tile import availability.
  - megatron/core/transformer/hyper_connection.py: adds the `native_*`
    reference functions and the call-site dispatch to fused vs native
    based on `config.use_fused_mhc` and `is_cutile_available()`.
  - megatron/core/transformer/transformer_config.py: adds
    `use_fused_mhc` and related kernel-config fields.
  - tests/unit_tests/fusions/test_fused_mhc_kernels.py: new — forward
    + backward parity tests for each fused kernel against its native
    reference (skipped when cuTile is not installed).

Depends on NVIDIA#4531 (Split 1) for the underlying mHC module and on
and transformer_config.py files in this PR pick up the merged-and-reviewed
state of those files from PR NVIDIA#4469 (dsv4 branch); the diff against the
current `mhc-pr1-core` therefore includes the kernel additions plus a
few review-pass refinements that landed on dsv4 after PR NVIDIA#4531 was cut.

Reviewer groups touched: core-adlr, core-nemo, transformer.

Final remaining split:
  - Split 5: functional-test recipe (`gpt3_mcore_te_tp2_pp2_mhc`)

Original work by @jingqiny-99 in NVIDIA#3430
(upstream kernel-fusion PR NVIDIA#3828).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request Jun 12, 2026
Final split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828).
Adds the end-to-end functional test for the mHC feature:

Files:
  - tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp2_pp2_mhc/
      model_config.yaml — TP=2, PP=2 GPT-3 mHC training recipe
      golden_values_dev_dgx_h100.json — golden metrics on dgx_h100
  - tests/test_utils/recipes/h100/gpt.yaml — registers the new test
    case under the mr / mr-github scopes for dgx_h100.

Golden values picked up from NVIDIA#4469 (dsv4 branch) where the recipe
has been calibrated against the final fused-kernel-on configuration.
Reviewers can recalibrate if needed once Splits 1-4 land.

Depends on NVIDIA#4531 (Split 1), NVIDIA#4945 (Split 2), and NVIDIA#4947 (Split 4) —
the recipe enables enable_hyper_connections via the GPT spec and
exercises the fused kernel path, so all three feature splits must
land first.

Reviewer groups touched: ci.

Original work by @jingqiny-99 in NVIDIA#3430
(upstream NVIDIA#2943).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request Jun 12, 2026
… manifold hyper connection

Adds the core Manifold Hyper Connection (mHC) module and the supporting
transformer-block / transformer-layer / config / recompute changes. This
is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828),
covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs
reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile
kernels, and the functional-test recipe are deferred to follow-up split PRs.

Original work by @jingqiny-99 in NVIDIA#3430
(upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Signed-off-by: Yan Xu <yxu1@nvidia.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request Jun 12, 2026
Fourth split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828).
Adds the cuTile-based fused implementation of the four mHC primitives
(Sinkhorn-Knopp, H-aggregate, H_post * BDA, projection + RMS) plus the
native PyTorch reference versions used as a fallback and for unit-test
comparison.

Files:
  - megatron/core/fusions/fused_mhc_kernels.py: new — cuTile kernel
    module; gated by cuda.tile import availability.
  - megatron/core/transformer/hyper_connection.py: adds the `native_*`
    reference functions and the call-site dispatch to fused vs native
    based on `config.use_fused_mhc` and `is_cutile_available()`.
  - megatron/core/transformer/transformer_config.py: adds
    `use_fused_mhc` and related kernel-config fields.
  - tests/unit_tests/fusions/test_fused_mhc_kernels.py: new — forward
    + backward parity tests for each fused kernel against its native
    reference (skipped when cuTile is not installed).

Depends on NVIDIA#4531 (Split 1) for the underlying mHC module and on
and transformer_config.py files in this PR pick up the merged-and-reviewed
state of those files from PR NVIDIA#4469 (dsv4 branch); the diff against the
current `mhc-pr1-core` therefore includes the kernel additions plus a
few review-pass refinements that landed on dsv4 after PR NVIDIA#4531 was cut.

Reviewer groups touched: core-adlr, core-nemo, transformer.

Final remaining split:
  - Split 5: functional-test recipe (`gpt3_mcore_te_tp2_pp2_mhc`)

Original work by @jingqiny-99 in NVIDIA#3430
(upstream kernel-fusion PR NVIDIA#3828).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Signed-off-by: Yan Xu <yxu1@nvidia.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request Jun 12, 2026
Final split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828).
Adds the end-to-end functional test for the mHC feature:

Files:
  - tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp2_pp2_mhc/
      model_config.yaml — TP=2, PP=2 GPT-3 mHC training recipe
      golden_values_dev_dgx_h100.json — golden metrics on dgx_h100
  - tests/test_utils/recipes/h100/gpt.yaml — registers the new test
    case under the mr / mr-github scopes for dgx_h100.

Golden values picked up from NVIDIA#4469 (dsv4 branch) where the recipe
has been calibrated against the final fused-kernel-on configuration.
Reviewers can recalibrate if needed once Splits 1-4 land.

Depends on NVIDIA#4531 (Split 1), NVIDIA#4945 (Split 2), and NVIDIA#4947 (Split 4) —
the recipe enables enable_hyper_connections via the GPT spec and
exercises the fused kernel path, so all three feature splits must
land first.

Reviewer groups touched: ci.

Original work by @jingqiny-99 in NVIDIA#3430
(upstream NVIDIA#2943).

Co-authored-by: jingqiny-99 <jingqiny@nvidia.com>
Co-authored-by: Dennis Liu <denliu@nvidia.com>
Signed-off-by: Yan Xu <yxu1@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants