[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise by danielvegamyhre · Pull Request #3002 · pytorch/ao

danielvegamyhre · 2025-09-14T23:28:34Z

Stacked PRs:

[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise

Summary

This PR adds a new CUDA kernel specifically for quantizing 3d expert weights shape (E,N,K) along the N dimension and writing directly to column major format.
- Design: I create separate input/output TMA descriptors for each expert, and process each 2d expert in parallel using the same method that the 2d dim1 quantization kernel uses. The 2d kernel achieves 85% peak memory bandwidth utilization, so hopefully we can achieve similar perf for 3d.
The existing methods for quantizing 3d expert weights both scale very poorly. I have verified this via benchmarking and traces (see previous PR), and hypothesize that it is due to required .contiguous() calls:
- Using to_mx only scales along the last dim and requires contiguos inputs. So this requires transposing contiguous tensor (E,N,K) -> (E,K,N) then calling .contiguous() to scale along the N dim (needed for backwards)
- Using the existing CUDA kernel for casting along dim1 is possible, by treating the 3d input tensor as a 2d tensor of shape (E*N, K). However, this produces a 2d output tensor in column major format, and there is no way to reshape and restride the tensor to be 3d again AND preserve the column major format, such that numerics are preserved. Thus, we have to transform the output to column major afterwards, requiring a .contiguous() call.

Test plan

Added tests that verify numerical accuracy

Kernel microbenchmarks

Perf is decent for large E and abysmal for small E. Need to investigate this.

Update (9/15): NCU shows 3d kernel operating on (2,8192,5120) tensor is actually compute bound (??)

input_shape         to_mx_us    cuda_2d_us    cuda_3d_us    to_mx_gbps    cuda_2d_gbps    cuda_3d_gbps
----------------  ----------  ------------  ------------  ------------  --------------  --------------
(1, 8192, 5120)      117.92         69.776       242.112      1078.19          1822.11         525.128
(2, 8192, 5120)      431.264       105.536       249.728       589.615         2409.41        1018.23
(4, 8192, 5120)      848.992       195.584       297.376       599.015         2600.21        1710.16
(8, 8192, 5120)     1682.59        379.904       412.992       604.495         2677.3         2462.8
(16, 8192, 5120)    3350.53        775.984       615.536       607.139         2621.49        3304.82
(64, 8192, 5120)   13352          3150.66       1959.71        609.418         2582.62        4152.11

pytorch-bot · 2025-09-14T23:28:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3002

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b64758e with merge base f75b251 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #3002, branch: danielvegamyhre/stack/69

danielvegamyhre · 2025-09-15T18:18:36Z

@slayton58 @ngimel i would be curious to get your thoughts on ways to improve this kernel for quantizing 3d expert weights (E,N,K) along the N dim, where weights are contiguous. It uses nearly identical logic to the 2d dim1 cast kernel (which achieves ~85% mem bw utilization), yet the perf is much worse (~8% to 62% peak mem bw, depending on input size - see benchmarks in PR description).

I think the culprit might be how i'm allocating all the TMA descriptors and passing them in, the overhead might be too much for small E? NCU has not flagged anything particularly helpful so far. Strangely, for E=2 it shows the kernel is compute bound with 78% compute throughput % and 38% memory bandwidth %.

Additional context: torch.compile and handwritten triton kernels were both slow for mxfp8 quant for RHS operands where we scale colwise (32x1 granularity) e.g., (triton hit 56% peak mem bw). So I added a CUDA kernel here which I derived from a TE kernel which achieves ~85% peak mem bw (#2513). Basically we stripped out internal TE types, added support for different scale calculation modes (floor, rceil) to align with torchao numerics, then resolved some perf issues resulting from those changes to get reasonable perf.

Now, I'm finding quantizing 3d expert weights along dim1 is scaling extremely poorly as number of experts increases (see this PR's description for details, and see #2999 for benchmarks). So I added a similar CUDA kernel to our mxfp8_cuda extension specifically for quantizing 3d expert weights colwise and writing directly to col major format we need it in.

The first approach I tried was just updating the 2d kernel to handle 3d tensors by treating it as a 2d tensor of shape (E*N, K) but the coordinate mapping / pointer arithemetic became a complicated mess that wasn't working. So I made a new kernel, that is similar to the 2d kernel but passes in separate input/output TMA descriptors for each expert, then the kernel operates on each 2d expert with logical separation, in parallel.

stack-info: PR: #3002, branch: danielvegamyhre/stack/69

ngimel · 2025-09-19T02:27:36Z

Discussed offline, likely culprit is cudaMallocManaged calls on the hot path https://github.com/pytorch/ao/pull/3002/files#diff-7ddc6623d9efea4ee4f4bdb3cdd7ef16ec3d3bc8bc974be85311125e464efb3dR1326, we should be able to create just a single descriptor and do the loads using the correct offsets.

danielvegamyhre · 2025-09-19T03:09:59Z

Discussed offline, likely culprit is cudaMallocManaged calls on the hot path https://github.com/pytorch/ao/pull/3002/files#diff-7ddc6623d9efea4ee4f4bdb3cdd7ef16ec3d3bc8bc974be85311125e464efb3dR1326, we should be able to create just a single descriptor and do the loads using the correct offsets.

I've been trying this today, progress so far is:

Scales for all experts are correct
Quantized data is correct for the expert=0 subtensor, but is all 0s for all other experts

This indicates the new input TMA descriptor and async loads are working properly, but the output TMA descriptor and/or async stores are not.

I think this is because the input tensor is in simple row major format, which can easily be represented in a TMA descriptor with shape (E*N, K) with stride K. However, the output data needs to be in "column major PER expert" format, so strides (N*K, 1, N). The 2d output TMA descriptor + ptx::cp_async_bulk_tensor_2d_global_to_shared do not seem capable of representing this layout so far (or could be a skill issue on my part, haha)

I will push a WIP PR on top of this stack to show the difference between this multiple-tma-descriptor approach, which is at least functionally correct, versus the single tma descriptor approach.

slayton58 · 2025-09-19T13:50:41Z

@danielvegamyhre I think there's a couple of options, can we do something like:

# out : [E, N, K], stride: [N*K, 1, N]#
out.transpose(2,1) # Now [E, K, N], stride: [N*K, N, 1], representable by row-major TMA descriptor (2d or otherwise)
# compute
out.transpose(1, 2) # Back to original form

or, we can try using 3d TMAs directly - ptx::cp_async_bulk_tensor_3d_global_to_shared is the relevant read invocation, and a 3d descriptor would have to be created for this.
or (finally) we can ignore TMA writes, and use regular global stores (so STG) - we shouldn't need the async, pipeline-able part of the TMA instruction.

danielvegamyhre · 2025-09-19T15:04:43Z

@slayton58 row major -> transpose to per expert col major is a great idea! Trying it now.

or, we can try using 3d TMAs directly

Yeah I've considered this, it is probably the "proper" way to do it but would require a larger refactor, hopefully the transpose method works.

drisspg · 2025-09-19T20:42:40Z

can we just merge the top of stack w/ this pR

drisspg · 2025-09-19T20:43:50Z

+          size_t SCALE_DIM_X, ScaleCalculationMode ScalingMode>
+__global__ void __launch_bounds__(MXFP8_THREADS_PER_CHUNK)
+    mxfp8_quantize_kernel_3d(
+        const CUtensorMap* tensor_maps_input,


Do we need to link against the cuda drivers in order to build this?

I think we fixed this at the top of the stack?

do you mean in the create tensor map functions? the other examples I saw do reference the cuda driver to get the cuTensorMapEncodeTiled function pointer, so I did the same for 3d tensor map creation.

danielvegamyhre added a commit that referenced this pull request Sep 14, 2025

[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise

146b42a

stack-info: PR: #3002, branch: danielvegamyhre/stack/69

danielvegamyhre force-pushed the danielvegamyhre/stack/69 branch from 2b1b340 to 146b42a Compare September 14, 2025 23:28

This was referenced Sep 14, 2025

[mxfp8 moe training] add compile support #2990

Merged

[mxfp8 moe training] use dim1 cast cuda kernel for 3d weights by reshaping to 2d #2998

Merged

[moe training] add benchmarks for dsv3 236b, 671b shapes; reorganize benchmarks dir #2999

Merged

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 14, 2025

danielvegamyhre added mx moe module: not user facing Use this tag if you don't want this PR to show up in release notes labels Sep 14, 2025

danielvegamyhre marked this pull request as draft September 14, 2025 23:49

danielvegamyhre changed the base branch from danielvegamyhre/stack/68 to main September 14, 2025 23:51

danielvegamyhre added a commit that referenced this pull request Sep 14, 2025

[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise

9921d5e

stack-info: PR: #3002, branch: danielvegamyhre/stack/69

danielvegamyhre force-pushed the danielvegamyhre/stack/69 branch from 146b42a to 9921d5e Compare September 14, 2025 23:51

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/68 September 14, 2025 23:51

danielvegamyhre changed the base branch from danielvegamyhre/stack/68 to main September 15, 2025 00:16

danielvegamyhre added a commit that referenced this pull request Sep 15, 2025

[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise

b3b709c

stack-info: PR: #3002, branch: danielvegamyhre/stack/69

danielvegamyhre force-pushed the danielvegamyhre/stack/69 branch from 9921d5e to b3b709c Compare September 15, 2025 00:16

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/68 September 15, 2025 00:16

danielvegamyhre marked this pull request as ready for review September 15, 2025 00:17

danielvegamyhre changed the base branch from danielvegamyhre/stack/68 to main September 15, 2025 02:18

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/68 September 15, 2025 02:18

danielvegamyhre mentioned this pull request Sep 15, 2025

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integrate it #3003

Closed

danielvegamyhre changed the base branch from danielvegamyhre/stack/68 to main September 15, 2025 02:19

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/68 September 15, 2025 02:19

danielvegamyhre mentioned this pull request Sep 15, 2025

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integrate it #3004

Merged

danielvegamyhre changed the base branch from danielvegamyhre/stack/68 to main September 15, 2025 20:38

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/68 September 15, 2025 20:38

danielvegamyhre changed the base branch from danielvegamyhre/stack/68 to main September 15, 2025 21:02

danielvegamyhre force-pushed the danielvegamyhre/stack/69 branch from 030f4f3 to 4ebcfec Compare September 17, 2025 15:28

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/68 September 17, 2025 15:28

danielvegamyhre changed the base branch from danielvegamyhre/stack/68 to main September 17, 2025 15:47

danielvegamyhre force-pushed the danielvegamyhre/stack/69 branch from 4ebcfec to 6403a25 Compare September 17, 2025 15:47

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/68 September 17, 2025 15:48

danielvegamyhre force-pushed the danielvegamyhre/stack/68 branch from 213a554 to fefb1e0 Compare September 17, 2025 16:14

danielvegamyhre added a commit that referenced this pull request Sep 17, 2025

[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise

a60ee11

stack-info: PR: #3002, branch: danielvegamyhre/stack/69

danielvegamyhre force-pushed the danielvegamyhre/stack/69 branch from 6403a25 to a60ee11 Compare September 17, 2025 16:14

danielvegamyhre changed the base branch from danielvegamyhre/stack/68 to main September 17, 2025 16:19

danielvegamyhre force-pushed the danielvegamyhre/stack/69 branch from a60ee11 to 6593572 Compare September 17, 2025 16:19

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/68 September 17, 2025 16:19

danielvegamyhre force-pushed the danielvegamyhre/stack/68 branch from 6dd01fc to 644d635 Compare September 17, 2025 16:24

danielvegamyhre added a commit that referenced this pull request Sep 17, 2025

[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise

367b67c

stack-info: PR: #3002, branch: danielvegamyhre/stack/69

danielvegamyhre force-pushed the danielvegamyhre/stack/69 branch from 6593572 to 367b67c Compare September 17, 2025 16:24

danielvegamyhre added a commit that referenced this pull request Sep 17, 2025

[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise

bb8b07f

stack-info: PR: #3002, branch: danielvegamyhre/stack/69

danielvegamyhre force-pushed the danielvegamyhre/stack/69 branch from 367b67c to bb8b07f Compare September 17, 2025 16:25

danielvegamyhre changed the base branch from danielvegamyhre/stack/68 to main September 17, 2025 16:25

[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise

b64758e

stack-info: PR: #3002, branch: danielvegamyhre/stack/69

danielvegamyhre force-pushed the danielvegamyhre/stack/69 branch from bb8b07f to b64758e Compare September 19, 2025 03:35

This was referenced Sep 19, 2025

[mxfp8 moe training] remove mxfp8_gemms.py #3033

Merged

[mxfp8 moe training] update 3d quant colwise scaling kernel to use single input/output TMA descriptors #3034

Merged

drisspg reviewed Sep 19, 2025

View reviewed changes

drisspg approved these changes Sep 19, 2025

View reviewed changes

danielvegamyhre merged commit f210443 into main Sep 19, 2025
18 checks passed

danielvegamyhre mentioned this pull request Sep 20, 2025

[mxfp8 moe training] use new 3d colwise quantization kernel #3037

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise#3002

[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise#3002
danielvegamyhre merged 1 commit into
mainfrom
danielvegamyhre/stack/69

danielvegamyhre commented Sep 14, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Sep 14, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Sep 15, 2025 •

edited

Loading

Uh oh!

ngimel commented Sep 19, 2025

Uh oh!

danielvegamyhre commented Sep 19, 2025 •

edited

Loading

Uh oh!

slayton58 commented Sep 19, 2025

Uh oh!

danielvegamyhre commented Sep 19, 2025

Uh oh!

drisspg commented Sep 19, 2025

Uh oh!

drisspg Sep 19, 2025

Uh oh!

danielvegamyhre Sep 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

danielvegamyhre commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Kernel microbenchmarks

Uh oh!

pytorch-bot Bot commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3002

✅ No Failures

Uh oh!

danielvegamyhre commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Sep 19, 2025

Uh oh!

danielvegamyhre commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slayton58 commented Sep 19, 2025

Uh oh!

danielvegamyhre commented Sep 19, 2025

Uh oh!

drisspg commented Sep 19, 2025

Uh oh!

drisspg Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danielvegamyhre commented Sep 14, 2025 •

edited

Loading

pytorch-bot Bot commented Sep 14, 2025 •

edited

Loading

danielvegamyhre commented Sep 15, 2025 •

edited

Loading

danielvegamyhre commented Sep 19, 2025 •

edited

Loading