[dev] mHC kernel fusion #3828
Conversation
|
/ok to test b4e6d8b |
|
/ok to test 28bf50e |
|
/claude review |
|
/ok to test 910f985 |
|
/claude review |
There was a problem hiding this comment.
Two issues found:
-
Sinkhorn fallback path missing numerical stability (
fused_mhc_kernels.py:64): The_ref_sinkhorn_fwdtorch.compile fallback uses baretorch.exp(input_logits)without the row-max subtraction that both the cuTile kernel and the originalSinkhornKnoppclass use. This can overflow for large logit values, especially in bf16. -
Sinkhorn eps silently changed from 1e-6 to 1e-8 (
hyper_connection.py:152): The originalSinkhornKnopp.epswas1e-6, butfused_sinkhorndefaults to1e-8and the call site doesn't pass eps. This is a behavioral change that may explain some of the golden value drift.
Also worth double-checking: the num-zeros golden values show large jumps at several steps (e.g., step 8: 1580→34401, step 14: 1777→34591) — roughly 20x increases. This pattern is intermittent (some steps are normal, some are ~34k). Worth confirming this is expected and not an artifact of the numerical stability issue above.
|
/ok to test 0416b10 |
|
/claude review |
|
/ok to test 578f5dd |
|
/ok to test 296a892 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23527281274 |
… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>
… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>
… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>
… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>
… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>
… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>
… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>
… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>
… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>
Fourth split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828). Adds the cuTile-based fused implementation of the four mHC primitives (Sinkhorn-Knopp, H-aggregate, H_post * BDA, projection + RMS) plus the native PyTorch reference versions used as a fallback and for unit-test comparison. Files: - megatron/core/fusions/fused_mhc_kernels.py: new — cuTile kernel module; gated by cuda.tile import availability. - megatron/core/transformer/hyper_connection.py: adds the `native_*` reference functions and the call-site dispatch to fused vs native based on `config.use_fused_mhc` and `is_cutile_available()`. - megatron/core/transformer/transformer_config.py: adds `use_fused_mhc` and related kernel-config fields. - tests/unit_tests/fusions/test_fused_mhc_kernels.py: new — forward + backward parity tests for each fused kernel against its native reference (skipped when cuTile is not installed). Depends on NVIDIA#4531 (Split 1) for the underlying mHC module and on and transformer_config.py files in this PR pick up the merged-and-reviewed state of those files from PR NVIDIA#4469 (dsv4 branch); the diff against the current `mhc-pr1-core` therefore includes the kernel additions plus a few review-pass refinements that landed on dsv4 after PR NVIDIA#4531 was cut. Reviewer groups touched: core-adlr, core-nemo, transformer. Final remaining split: - Split 5: functional-test recipe (`gpt3_mcore_te_tp2_pp2_mhc`) Original work by @jingqiny-99 in NVIDIA#3430 (upstream kernel-fusion PR NVIDIA#3828). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>
Final split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828). Adds the end-to-end functional test for the mHC feature: Files: - tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp2_pp2_mhc/ model_config.yaml — TP=2, PP=2 GPT-3 mHC training recipe golden_values_dev_dgx_h100.json — golden metrics on dgx_h100 - tests/test_utils/recipes/h100/gpt.yaml — registers the new test case under the mr / mr-github scopes for dgx_h100. Golden values picked up from NVIDIA#4469 (dsv4 branch) where the recipe has been calibrated against the final fused-kernel-on configuration. Reviewers can recalibrate if needed once Splits 1-4 land. Depends on NVIDIA#4531 (Split 1), NVIDIA#4945 (Split 2), and NVIDIA#4947 (Split 4) — the recipe enables enable_hyper_connections via the GPT spec and exercises the fused kernel path, so all three feature splits must land first. Reviewer groups touched: ci. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com>
… manifold hyper connection Adds the core Manifold Hyper Connection (mHC) module and the supporting transformer-block / transformer-layer / config / recompute changes. This is the first split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828), covering only files owned by core-adlr / core-nemo / transformer / cuda-graphs reviewers. GPT model wiring, pipeline-parallel support, fused mHC cuTile kernels, and the functional-test recipe are deferred to follow-up split PRs. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943, basic pytorch impl only — kernel fusion NVIDIA#3828 deferred). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>
Fourth split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828). Adds the cuTile-based fused implementation of the four mHC primitives (Sinkhorn-Knopp, H-aggregate, H_post * BDA, projection + RMS) plus the native PyTorch reference versions used as a fallback and for unit-test comparison. Files: - megatron/core/fusions/fused_mhc_kernels.py: new — cuTile kernel module; gated by cuda.tile import availability. - megatron/core/transformer/hyper_connection.py: adds the `native_*` reference functions and the call-site dispatch to fused vs native based on `config.use_fused_mhc` and `is_cutile_available()`. - megatron/core/transformer/transformer_config.py: adds `use_fused_mhc` and related kernel-config fields. - tests/unit_tests/fusions/test_fused_mhc_kernels.py: new — forward + backward parity tests for each fused kernel against its native reference (skipped when cuTile is not installed). Depends on NVIDIA#4531 (Split 1) for the underlying mHC module and on and transformer_config.py files in this PR pick up the merged-and-reviewed state of those files from PR NVIDIA#4469 (dsv4 branch); the diff against the current `mhc-pr1-core` therefore includes the kernel additions plus a few review-pass refinements that landed on dsv4 after PR NVIDIA#4531 was cut. Reviewer groups touched: core-adlr, core-nemo, transformer. Final remaining split: - Split 5: functional-test recipe (`gpt3_mcore_te_tp2_pp2_mhc`) Original work by @jingqiny-99 in NVIDIA#3430 (upstream kernel-fusion PR NVIDIA#3828). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>
Final split of NVIDIA#3430 (mirror of upstream NVIDIA#2943 + kernel-fusion NVIDIA#3828). Adds the end-to-end functional test for the mHC feature: Files: - tests/functional_tests/test_cases/gpt/gpt3_mcore_te_tp2_pp2_mhc/ model_config.yaml — TP=2, PP=2 GPT-3 mHC training recipe golden_values_dev_dgx_h100.json — golden metrics on dgx_h100 - tests/test_utils/recipes/h100/gpt.yaml — registers the new test case under the mr / mr-github scopes for dgx_h100. Golden values picked up from NVIDIA#4469 (dsv4 branch) where the recipe has been calibrated against the final fused-kernel-on configuration. Reviewers can recalibrate if needed once Splits 1-4 land. Depends on NVIDIA#4531 (Split 1), NVIDIA#4945 (Split 2), and NVIDIA#4947 (Split 4) — the recipe enables enable_hyper_connections via the GPT spec and exercises the fused kernel path, so all three feature splits must land first. Reviewer groups touched: ci. Original work by @jingqiny-99 in NVIDIA#3430 (upstream NVIDIA#2943). Co-authored-by: jingqiny-99 <jingqiny@nvidia.com> Co-authored-by: Dennis Liu <denliu@nvidia.com> Signed-off-by: Yan Xu <yxu1@nvidia.com>
What does this PR do ?
kernel fusion (using cuTile) for mHC.
Verified using 27B Model (introduced in original mHC paper) with GB200
Throughput:

Validation LM Loss:

Iteration Time:

Kernel Performance
Config:
s=8192, b=4, n=4, C=7168Prerequisites
Linux x86_64, Linux aarch64 or Windows x86_64
A GPU with compute capability 8.x 10.x, 11.x or 12.x
NVIDIA Driver r580 or later
Python version 3.10, 3.11, 3.12 or 3.13
Usage
Install CuTile:
pip install cuda-tile[tileiras]Enable fusion kernel with
--use-fused-mhcContribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.