[Common] MXFP8 kernel for grouped tensors by Oleg-Goncharov · Pull Request #2586 · NVIDIA/TransformerEngine

Oleg-Goncharov · 2026-01-12T18:10:19Z

Description

This PR adds a new kernel that supports MXFP8 quantization of grouped tensors.

Below is a performance comparison of tensor-descriptor updates with O(log N) vs. O(N) complexity for varying numbers of descriptors (N = 2, 4, 8, …, 64). The input grouped tensors are N × [256, 8192]. Run on GB300.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added MXFP8 cast kernel for grouped tensors
Added the test suite

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-01-24T00:57:45Z

Greptile Overview

Greptile Summary

This PR implements MXFP8 quantization for grouped tensors, adding a new GPU kernel that uses TMA (Tensor Memory Accelerator) descriptors for efficient data transfer and O(log N) binary search for tensor identification in grouped tensor scenarios.

Key Changes:

Added group_quantize_mxfp8.cuh with the core MXFP8 grouped quantization kernel supporting rowwise/columnwise scaling
Implemented TMA descriptor update mechanism with O(log N) complexity for varying tensor shapes
Extended C API in cast.h with 7 new grouped tensor quantization functions
Added grouped variants for activation functions (GeLU, ReLU, SiLU, QGeLU, SReLU) with dbias support
Comprehensive test suite with reference implementation covering multiple shape representations

Issues from Previous Comments:
Previous review threads identified several concerns that appear to remain unresolved:

Binary search underflow risk when current_offset < offsets_ptr[0] (line 85)
Uninitialized shape_rep variable if none of the four shape conditions match (line 755-764)
Commented-out code creating ambiguity in switch statement (line 104)
Typos: "gropued" instead of "grouped" in multiple locations in cast.h
Missing newline after conditional compilation block

Architecture:
The implementation uses a two-kernel approach: first updating TMA descriptors per tensor, then launching the main quantization kernel that uses binary search to identify which tensor each block processes. The kernel supports both single-tensor (constant last dimension) and multi-tensor cases with different optimization paths.

Confidence Score: 3/5

This PR requires addressing several logic issues before merging, particularly around variable initialization and edge case handling
Score reflects substantial new functionality with comprehensive tests, but multiple unresolved concerns from previous review including potential binary search underflow, uninitialized variables, and typos that need to be addressed
Primary attention needed on transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh for initialization and edge case issues, and transformer_engine/common/include/transformer_engine/cast.h for typo corrections

Important Files Changed

Filename	Overview
transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh	New MXFP8 grouped quantization kernel with TMA descriptors and binary search for tensor identification; previous comments identified potential issues with uninitialized variables and binary search underflow
tests/cpp/operator/test_cast_mxfp8_grouped.cu	Comprehensive test suite with reference implementation and multiple shape representations for grouped tensor quantization
transformer_engine/common/cast/dispatch/quantize.cuh	Added dispatcher helpers for grouped tensor quantization following existing pattern for regular tensors
transformer_engine/common/include/transformer_engine/cast.h	Added C API declarations for grouped tensor quantization; typos reported in previous comments

Sequence Diagram

sequenceDiagram
    participant User
    participant API as C API Layer<br/>(cast.cu)
    participant Dispatcher as Dispatch Layer<br/>(quantize.cuh)
    participant Kernel as MXFP8 Kernel<br/>(group_quantize_mxfp8.cuh)
    participant GPU as GPU Device

    User->>API: nvte_group_quantize(input, output, stream)
    API->>Dispatcher: group_quantize_fwd_helper()
    Dispatcher->>Dispatcher: Convert NVTEGroupedTensor to GroupedTensor*
    Dispatcher->>Dispatcher: Check scaling_mode == NVTE_MXFP8_1D_SCALING
    
    alt Multi-tensor case (VARYING_LAST_DIM or VARYING_BOTH_DIMS)
        Dispatcher->>Kernel: update_tma_descriptors<<<num_tensors, 32>>>()
        Kernel->>GPU: Launch descriptor update kernel
        loop For each tensor in group
            GPU->>GPU: modify_base_tensor_map()<br/>Update tensor map for each tensor's data pointer
        end
        GPU-->>Kernel: TMA descriptors updated
    end
    
    Dispatcher->>Kernel: group_quantize_mxfp8_kernel<<<blocks, 128>>>()
    Kernel->>GPU: Launch main quantization kernel
    
    loop For each block
        GPU->>GPU: get_current_tensor_id()<br/>Binary search to find tensor ID
        GPU->>GPU: Acquire TMA fence for tensor map
        GPU->>GPU: TMA load input data to shared memory
        
        alt COLWISE_SCALING
            GPU->>GPU: Compute column-wise AMAX
            GPU->>GPU: Generate E8M0 scale factor
            GPU->>GPU: Quantize to MXFP8 with column-wise scale
            GPU->>GPU: TMA store to global memory
        end
        
        alt ROWWISE_SCALING
            GPU->>GPU: Compute row-wise AMAX
            GPU->>GPU: Generate E8M0 scale factor
            GPU->>GPU: Quantize to MXFP8 with row-wise scale
            GPU->>GPU: TMA store to global memory
        end
    end
    
    GPU-->>Dispatcher: Quantization complete
    Dispatcher-->>API: Return
    API-->>User: Return

greptile-apps

_{10 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps

_{10 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

ptrendx · 2026-02-03T01:30:28Z

+    const __grid_constant__ CUtensorMap tensor_map_act_input_static,
+    const __grid_constant__ CUtensorMap tensor_map_output_rowwise_static,
+    const __grid_constant__ CUtensorMap tensor_map_output_colwise_static,
+    const ShapeRepresentation shape_rep, const size_t num_tensors, const size_t first_logical_dim,


Is having it as a regular parameter not impacting the performance?

it = shape_rep

I haven’t measured the performance impact, but it should be very small since it’s only used during initialization and isn’t on the critical path

ptrendx · 2026-02-03T01:36:31Z

+  NVTE_CHECK(last_logical_dim % 128 == 0,
+             "Last dimension of a grouped tensor should be divisible by 128.");


Do we need that? I think we only need that if we want columnwise scaling, no?

I initially assumed a full 128×128 tile input, but we can relax this restriction for a single-tensor view with a simple change. The input/output alignment is validated when the tensor descriptor is created. However, we need special care when the last dimension varies across inputs (i.e., when it can’t be viewed as a single tensor). In that case, we should validate alignment when updating the tensor descriptors in the helper kernel and raise an error if the data is not aligned.

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

vthumbe1503 · 2026-02-05T21:00:55Z

/te-ci L1 pytorch

vthumbe1503 · 2026-02-05T21:10:37Z

@Oleg-Goncharov, I have tested grouped_quantize from a Pytorch binding created for nvte_grouped_quantize and it works fine for all four cases of (first_dims, last_dims). And the changes in the PR look ok to me based on what I could understand. Could we merge this @Oleg-Goncharov @ptrendx ?

vthumbe1503 · 2026-02-06T00:45:35Z

/te-ci pytorch

vthumbe1503 · 2026-02-06T01:25:05Z

/te-ci pytorch

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Oleg-Goncharov · 2026-02-06T18:08:16Z

Thank you for checking it, @vthumbe1503. I’m working on extending dbias support to output a split grouped tensor, since the kernel currently accumulates dbias into a single tensor. Let’s merge this once that’s in.

ptrendx · 2026-02-06T19:46:47Z

@Oleg-Goncharov Please open a new PR with the proper dbias support - let's try to minimize the review effort.

yaox12 mentioned this pull request Jan 14, 2026

MoE training optimization #2438

Open

Oleg-Goncharov force-pushed the pr_mxfp8_grouped_kernel branch 4 times, most recently from e6bf02a to fc2a53f Compare January 15, 2026 16:15

Oleg-Goncharov added enhancement New feature or request MoE labels Jan 15, 2026

ptrendx linked an issue Jan 16, 2026 that may be closed by this pull request

Quantization support for GroupedTensor: MXFP8 #2450

Closed

Rebased to main

88cf1b2

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov force-pushed the pr_mxfp8_grouped_kernel branch from 74a7917 to 88cf1b2 Compare January 21, 2026 17:00

pre-commit-ci Bot and others added 6 commits January 21, 2026 17:00

[pre-commit.ci] auto fixes from pre-commit.com hooks

ac23f06

for more information, see https://pre-commit.ci

Merge branch 'main' into pr_mxfp8_grouped_kernel

44ec5ba

Fixed the year to 2026

99f1f63

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Added compilation guards

7415138

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

adacda9

for more information, see https://pre-commit.ci

Added BWD pass

39bb24f

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov force-pushed the pr_mxfp8_grouped_kernel branch from 7c4fda7 to 39bb24f Compare January 22, 2026 18:12

Oleg-Goncharov and others added 2 commits January 22, 2026 19:13

Merge branch 'main' into pr_mxfp8_grouped_kernel

02c05a6

[pre-commit.ci] auto fixes from pre-commit.com hooks

452651a

for more information, see https://pre-commit.ci

Oleg-Goncharov requested a review from ptrendx January 22, 2026 22:19

vthumbe1503 and others added 4 commits January 23, 2026 02:49

Merge branch 'main' into pr_mxfp8_grouped_kernel

9da18bf

Added dbias and dact tests. Refactoring.

e8beb1e

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b3f8468

for more information, see https://pre-commit.ci

Added grouped MXFP8 DACT and ACT API and tests

1235167

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov marked this pull request as ready for review January 24, 2026 00:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

34b9dfd

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Jan 24, 2026

View reviewed changes

Fixed a typo

6dd3814

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps Bot reviewed Jan 24, 2026

View reviewed changes

Comment thread transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh Outdated

ptrendx reviewed Feb 2, 2026

View reviewed changes

Comment thread transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh Outdated

ptrendx reviewed Feb 3, 2026

View reviewed changes

Comment thread transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh

ptrendx reviewed Feb 3, 2026

View reviewed changes

Comment thread transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh

ptrendx reviewed Feb 3, 2026

View reviewed changes

Merge branch 'main' into pr_mxfp8_grouped_kernel

3645aab

greptile-apps Bot reviewed Feb 3, 2026

View reviewed changes

Fixes per the review

425c720

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps Bot reviewed Feb 3, 2026

View reviewed changes

Relaxed requirement for last dim from mod128 to mod32

e9ddde1

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov force-pushed the pr_mxfp8_grouped_kernel branch from d2621c4 to e9ddde1 Compare February 4, 2026 11:44

[pre-commit.ci] auto fixes from pre-commit.com hooks

273f4c5

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Feb 4, 2026

View reviewed changes

Merge branch 'main' into pr_mxfp8_grouped_kernel

6f54d5c

greptile-apps Bot reviewed Feb 4, 2026

View reviewed changes

Fix

c867943

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps Bot reviewed Feb 4, 2026

View reviewed changes

Oleg-Goncharov and others added 2 commits February 4, 2026 14:08

Added alignment checks when tensor descriptors are modified

0883e5a

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

132b1c0

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Feb 4, 2026

View reviewed changes

ptrendx approved these changes Feb 6, 2026

View reviewed changes

Merge branch 'main' into pr_mxfp8_grouped_kernel

d68a427

greptile-apps Bot reviewed Feb 6, 2026

View reviewed changes

ptrendx merged commit 7393947 into NVIDIA:main Feb 6, 2026
21 of 24 checks passed

		NVTE_CHECK(last_logical_dim % 128 == 0,
		"Last dimension of a grouped tensor should be divisible by 128.");

Conversation

Oleg-Goncharov commented Jan 12, 2026 • edited by ptrendx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptrendx Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ptrendx Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented Feb 5, 2026

Uh oh!

vthumbe1503 commented Feb 5, 2026

Uh oh!

vthumbe1503 commented Feb 6, 2026

Uh oh!

vthumbe1503 commented Feb 6, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov commented Feb 6, 2026

Uh oh!

Uh oh!

ptrendx commented Feb 6, 2026

Oleg-Goncharov commented Jan 12, 2026 •

edited by ptrendx

Loading

greptile-apps Bot commented Jan 24, 2026 •

edited

Loading