[dev] mHC kernel fusion by jingqiny-99 · Pull Request #3828 · NVIDIA/Megatron-LM

jingqiny-99 · 2026-03-12T11:49:40Z

What does this PR do ?

kernel fusion (using cuTile) for mHC.

Verified using 27B Model (introduced in original mHC paper) with GB200

Throughput:

Validation LM Loss:

Iteration Time:

Kernel Performance

Config: `s=8192, b=4, n=4, C=7168`

Kernel	Framework	Fwd (ms)	Bwd (ms)	E2E (ms)	Speedup
sinkhorn	compiled	0.346	1.018	1.364	1.00x
sinkhorn	cutile	0.065	0.133	0.198	6.89x
h_aggregate	compiled	0.338	0.808	1.146	1.00x
h_aggregate	cutile	0.349	0.661	1.010	1.13x
h_post_bda	compiled	3.829	6.663	10.492	1.00x
h_post_bda	cutile	0.684	2.549	3.234	3.24x
proj_rms	compiled	0.554	2.204	2.758	1.00x
proj_rms	cutile	0.426	1.541	1.967	1.40x

Prerequisites

Linux x86_64, Linux aarch64 or Windows x86_64

A GPU with compute capability 8.x 10.x, 11.x or 12.x

NVIDIA Driver r580 or later

Python version 3.10, 3.11, 3.12 or 3.13

Usage

Install CuTile: pip install cuda-tile[tileiras]

Enable fusion kernel with --use-fused-mhc

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

copy-pr-bot · 2026-03-12T11:49:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jingqiny-99 · 2026-03-12T11:50:14Z

/ok to test b4e6d8b

jingqiny-99 · 2026-03-13T01:25:31Z

/ok to test 28bf50e

jingqiny-99 · 2026-03-13T01:37:53Z

/claude review

jingqiny-99 · 2026-03-13T02:26:40Z

/ok to test 910f985

jingqiny-99 · 2026-03-13T02:32:15Z

/claude review

claude

Two issues found:

Sinkhorn fallback path missing numerical stability (fused_mhc_kernels.py:64): The _ref_sinkhorn_fwd torch.compile fallback uses bare torch.exp(input_logits) without the row-max subtraction that both the cuTile kernel and the original SinkhornKnopp class use. This can overflow for large logit values, especially in bf16.
Sinkhorn eps silently changed from 1e-6 to 1e-8 (hyper_connection.py:152): The original SinkhornKnopp.eps was 1e-6, but fused_sinkhorn defaults to 1e-8 and the call site doesn't pass eps. This is a behavioral change that may explain some of the golden value drift.

Also worth double-checking: the num-zeros golden values show large jumps at several steps (e.g., step 8: 1580→34401, step 14: 1777→34591) — roughly 20x increases. This pattern is intermittent (some steps are normal, some are ~34k). Worth confirming this is expected and not an artifact of the numerical stability issue above.

jingqiny-99 · 2026-03-13T03:18:04Z

/ok to test 0416b10

jingqiny-99 · 2026-03-13T04:48:17Z

/claude review

jingqiny-99 · 2026-03-13T04:54:47Z

/ok to test 578f5dd

jingqiny-99 · 2026-03-13T05:00:37Z

/ok to test 296a892

svcnvidia-nemo-ci · 2026-03-25T06:00:50Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23527281274

@jingqiny-99

… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>

@jingqiny-99

… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>

@jingqiny-99

… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>

@jingqiny-99

… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>

@jingqiny-99

… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>

@jingqiny-99

… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>

@jingqiny-99

… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>

@jingqiny-99

… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>

@jingqiny-99

… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>

@jingqiny-99

Fourth split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828). Adds the cuTile-based fused implementation of the four mHC primitives (Sinkhorn-Knopp, H-aggregate, H_post * BDA, projection + RMS) plus the native PyTorch reference versions used as a fallback and for unit-test comparison. Files: - megatron/core/fusions/fused_mhc_kernels.py: new — cuTile kernel module; gated by cuda.tile import availability. - megatron/core/transformer/hyper_connection.py: adds the `native_*` reference functions and the call-site dispatch to fused vs native based on `config.use_fused_mhc` and `is_cutile_available()`. - megatron/core/transformer/transformer_config.py: adds `use_fused_mhc` and related kernel-config fields. - tests/unit_tests/fusions/test_fused_mhc_kernels.py: new — forward + backward parity tests for each fused kernel against its native reference (skipped when cuTile is not installed). Depends on NVIDIA#4531 (Split 1) for the underlying mHC module and on and transformer_config.py files in this PR pick up the merged-and-reviewed state of those files from PR NVIDIA#4469 (dsv4 branch); the diff against the current `mhc-pr1-core` therefore includes the kernel additions plus a few review-pass refinements that landed on dsv4 after PR NVIDIA#4531 was cut. Reviewer groups touched: core-adlr, core-nemo, transformer. Final remaining split: - Split 5: functional-test recipe (`gpt3_mcore_te_tp2_pp2_mhc`) Original work by @jingqiny-99 in NVIDIA#3430 (upstream kernel-fusion PR NVIDIA#3828). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>

@jingqiny-99

Final split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828). Adds the end-to-end functional test for the mHC feature: Files: - tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp2_pp2_mhc/ model_config.yaml — TP=2, PP=2 GPT-3 mHC training recipe golden_values_dev_dgx_h100.json — golden metrics on dgx_h100 - tests/test_utils/recipes/h100/gpt.yaml — registers the new test case under the mr / mr-github scopes for dgx_h100. Golden values picked up from NVIDIA#4469 (dsv4 branch) where the recipe has been calibrated against the final fused-kernel-on configuration. Reviewers can recalibrate if needed once Splits 1-4 land. Depends on NVIDIA#4531 (Split 1), NVIDIA#4945 (Split 2), and NVIDIA#4947 (Split 4) — the recipe enables enable_hyper_connections via the GPT spec and exercises the fused kernel path, so all three feature splits must land first. Reviewer groups touched: ci. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>

@jingqiny-99

… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>

@jingqiny-99

Fourth split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828). Adds the cuTile-based fused implementation of the four mHC primitives (Sinkhorn-Knopp, H-aggregate, H_post * BDA, projection + RMS) plus the native PyTorch reference versions used as a fallback and for unit-test comparison. Files: - megatron/core/fusions/fused_mhc_kernels.py: new — cuTile kernel module; gated by cuda.tile import availability. - megatron/core/transformer/hyper_connection.py: adds the `native_*` reference functions and the call-site dispatch to fused vs native based on `config.use_fused_mhc` and `is_cutile_available()`. - megatron/core/transformer/transformer_config.py: adds `use_fused_mhc` and related kernel-config fields. - tests/unit_tests/fusions/test_fused_mhc_kernels.py: new — forward + backward parity tests for each fused kernel against its native reference (skipped when cuTile is not installed). Depends on NVIDIA#4531 (Split 1) for the underlying mHC module and on and transformer_config.py files in this PR pick up the merged-and-reviewed state of those files from PR NVIDIA#4469 (dsv4 branch); the diff against the current `mhc-pr1-core` therefore includes the kernel additions plus a few review-pass refinements that landed on dsv4 after PR NVIDIA#4531 was cut. Reviewer groups touched: core-adlr, core-nemo, transformer. Final remaining split: - Split 5: functional-test recipe (`gpt3_mcore_te_tp2_pp2_mhc`) Original work by @jingqiny-99 in NVIDIA#3430 (upstream kernel-fusion PR NVIDIA#3828). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>

@jingqiny-99

Final split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828). Adds the end-to-end functional test for the mHC feature: Files: - tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp2_pp2_mhc/ model_config.yaml — TP=2, PP=2 GPT-3 mHC training recipe golden_values_dev_dgx_h100.json — golden metrics on dgx_h100 - tests/test_utils/recipes/h100/gpt.yaml — registers the new test case under the mr / mr-github scopes for dgx_h100. Golden values picked up from NVIDIA#4469 (dsv4 branch) where the recipe has been calibrated against the final fused-kernel-on configuration. Reviewers can recalibrate if needed once Splits 1-4 land. Depends on NVIDIA#4531 (Split 1), NVIDIA#4945 (Split 2), and NVIDIA#4947 (Split 4) — the recipe enables enable_hyper_connections via the GPT spec and exercises the fused kernel path, so all three feature splits must land first. Reviewer groups touched: ci. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>

jingqiny-99 added 3 commits March 11, 2026 10:03

init: mHC cutile kernels

ad0504b

upd: refine implementations

7d60cc7

upd: lint

ba30667

Merge branch 'dev' into jingqiny/feature-mHC-fusion

b4e6d8b

svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 12, 2026

upd: fix dependency for Hopper GPU, update FT golden value

28bf50e