[mxfp8 moe training] update 3d quant colwise scaling kernel to use single input/output TMA descriptors by danielvegamyhre · Pull Request #3034 · pytorch/ao

danielvegamyhre · 2025-09-19T03:35:50Z

Stacked PRs:

->[mxfp8 moe training] update 3d quant colwise scaling kernel to use single input/output TMA descriptors #3034

[mxfp8 moe training] update 3d quant colwise scaling kernel to use single input/output TMA descriptors

Summary

CUDA Kernel for 3d quantization across cols added in [mxfp8 moe training] wrap 3d quantize tensor in custom ops and integrate it #3004 has worse perf than other methods for small num_experts, and is only better for large num_experts.
We hypothesize this is because cudaMallocManaged of separate TMA descriptors per expert, which is a slow/blocking function. The overhead is constant, and thus more noticeable for small inputs.
In this PR, I redesign the kernel to use single input and output TMA descriptor for the whole 3d tensor.
- For the input, it is in simple row major format, so I can read from specific experts by adjusting the TMA row offset during the async TMA load.
- For the output, it is a more complex "column major per expert" format, so I use a 3d TMA descriptor with the specific shape and strides needed. I transpose the row major data in SMEM before doing the async TMA store to GMEM to get it in col major per expert format.

Test plan

Add unit tests for Llama4 and DeepSeekV3 shapes
sanitize pytest test/prototype/moe_training/test_kernels.py

Performance

input_shape         to_mx_us    cuda_2d_us    cuda_3d_us    to_mx_gbps    cuda_2d_gbps    cuda_3d_gbps
----------------  ----------  ------------  ------------  ------------  --------------  --------------
(1, 8192, 5120)      118.656        68.016        34.848      1071.5           1869.26         3648.41
(2, 8192, 5120)      430.064       105.472        61.44        591.26          2410.87         4138.67
(4, 8192, 5120)      847.824       197.632       118.784       599.841         2573.26         4281.38
(8, 8192, 5120)     1693.76        378.848       252.832       600.509         2684.77         4022.9
(16, 8192, 5120)    3449.66        742.368       489.44        589.691         2740.2          4156.26
(64, 8192, 5120)   13354          3145.7        1814.62        609.328         2586.69         4484.1

pytorch-bot · 2025-09-19T03:35:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3034

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

ngimel · 2025-09-19T22:30:26Z

What are cuda_2d numbers in the benchmark? Running in a loop for each expert?

danielvegamyhre · 2025-09-19T22:50:32Z

What are cuda_2d numbers in the benchmark? Running in a loop for each expert?

The benchmarks are from this script.

The "cuda_2d" benchmarks are referencing this function which uses the 2d cuda colwise quantization on the 3d tensor, by reshaping it in pytorch from (E*N, K), quantizing, then reshaping the output and scales appropriately back to 3d, to match the expectations of torch._scaled_grouped_mm mxfp8 2d-3d grouped gemm.

The key issue with this method is that the quantized (E*N, K) is in column major format, and I couldn't find a way to reshape/view back to (E,N,K) with per expert column major format by simplying mutating the tensor metadata - I had to do a physical memory layout transformation here, which is not ideal.

So I called that method cuda_2d in the script since it's using the 2d quantization kernel (name could be better, just trying to keep it concise).

ngimel · 2025-09-19T22:50:35Z

+                                         uint32_t shmem_k,
+                                         const size_t type_num_bits) {
+  // Get function pointer to cuTensorMapEncodeTiled
+  static void *driver_ptr = nullptr;


since both 2d and 3d map are using this, I think you should use a function that would return it (to not initialize it twice)

Hmm good point. Updated to do something like a singleton pattern, holding driver ptr as a global/static var that starts as null then we only initialize it once whenever the first kernel is called. Let me know if that is what you had in mind / will work. (It builds and tests pass)

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

ngimel · 2025-09-19T23:13:30Z

  }
 }

+static void *driver_ptr = nullptr;


I'd prefer something like

void * get_driver_ptr() { static void * driver_ptr = nullptr; if (!driver_ptr) { cudaDriverEntryPointQueryResult result; cudaGetDriverEntryPoint("cuTensorMapEncodeTiled", &driver_ptr, cudaEnableDefault, &result); } return driver_ptr; }

that gets called from both create_3D_tensor_map and create_2D_tensor_map, but this would work too. If you are going with this you should put it in an anonymous namespace to not pollute other files that may include this, driver_ptr is pretty generic.

Makes sense - updated.

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

ngimel · 2025-09-19T23:38:50Z

+  static void *driver_ptr = nullptr;
+  if (!driver_ptr) {
+    cudaDriverEntryPointQueryResult result;
+    cudaGetDriverEntryPoint("cuTensorMapEncodeTiled", &driver_ptr,


also, you should error check this call

thanks, done

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

4a2210f

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/73 branch from 4f9a778 to 0f949d8 Compare September 19, 2025 03:35

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from 60d9553 to 4a2210f Compare September 19, 2025 03:35

This was referenced Sep 19, 2025

[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise #3002

Merged

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integrate it #3004

Merged

[mxfp8 moe training] remove mxfp8_gemms.py #3033

Merged

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 19, 2025

danielvegamyhre marked this pull request as draft September 19, 2025 03:51

danielvegamyhre changed the base branch from danielvegamyhre/stack/73 to main September 19, 2025 03:54

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

928fe67

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from 4a2210f to 928fe67 Compare September 19, 2025 03:54

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

ab67f71

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from 928fe67 to ab67f71 Compare September 19, 2025 03:55

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/73 September 19, 2025 03:55

danielvegamyhre changed the base branch from danielvegamyhre/stack/73 to main September 19, 2025 18:00

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

db78695

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from ab67f71 to db78695 Compare September 19, 2025 18:00

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/73 September 19, 2025 18:00

danielvegamyhre changed the base branch from danielvegamyhre/stack/73 to main September 19, 2025 18:10

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

caa7abc

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from db78695 to caa7abc Compare September 19, 2025 18:10

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/73 September 19, 2025 18:10

danielvegamyhre changed the base branch from danielvegamyhre/stack/73 to main September 19, 2025 18:29

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

e9d937f

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from caa7abc to e9d937f Compare September 19, 2025 18:29

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/73 September 19, 2025 18:29

danielvegamyhre changed the base branch from danielvegamyhre/stack/73 to main September 19, 2025 18:36

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

8f06b66

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from e9d937f to 8f06b66 Compare September 19, 2025 18:36

ngimel approved these changes Sep 19, 2025

View reviewed changes

drisspg approved these changes Sep 19, 2025

View reviewed changes

drisspg reviewed Sep 19, 2025

View reviewed changes

Comment thread torchao/csrc/cuda/mx_kernels/mxfp8_quantize.cuh Outdated

danielvegamyhre changed the base branch from danielvegamyhre/stack/73 to main September 19, 2025 20:42

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

af56887

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from 333f316 to af56887 Compare September 19, 2025 20:42

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/73 September 19, 2025 20:42

danielvegamyhre force-pushed the danielvegamyhre/stack/73 branch from 0f949d8 to ddcd761 Compare September 19, 2025 20:54

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

3ccbd21

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from af56887 to 3ccbd21 Compare September 19, 2025 20:54

danielvegamyhre force-pushed the danielvegamyhre/stack/73 branch from ddcd761 to 94bd695 Compare September 19, 2025 20:57

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

908f676

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from 3ccbd21 to 908f676 Compare September 19, 2025 20:57

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

7fc6c79

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from 908f676 to 7fc6c79 Compare September 19, 2025 20:59

danielvegamyhre changed the base branch from danielvegamyhre/stack/73 to main September 19, 2025 20:59

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

96ec18a

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from 7fc6c79 to 96ec18a Compare September 19, 2025 22:10

ngimel reviewed Sep 19, 2025

View reviewed changes

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

081feea

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from 96ec18a to 081feea Compare September 19, 2025 22:59

ngimel reviewed Sep 19, 2025

View reviewed changes

danielvegamyhre added a commit that referenced this pull request Sep 19, 2025

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

248df72

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from 081feea to 248df72 Compare September 19, 2025 23:22

ngimel reviewed Sep 20, 2025

View reviewed changes

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

28f2ad8

…ngle input/output TMA descriptors stack-info: PR: #3034, branch: danielvegamyhre/stack/74

danielvegamyhre force-pushed the danielvegamyhre/stack/74 branch from 248df72 to 28f2ad8 Compare September 20, 2025 02:26

danielvegamyhre merged commit d2fae7a into main Sep 20, 2025
14 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mxfp8 moe training] update 3d quant colwise scaling kernel to use single input/output TMA descriptors#3034

[mxfp8 moe training] update 3d quant colwise scaling kernel to use single input/output TMA descriptors#3034
danielvegamyhre merged 1 commit into
mainfrom
danielvegamyhre/stack/74

danielvegamyhre commented Sep 19, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Sep 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

ngimel commented Sep 19, 2025

Uh oh!

danielvegamyhre commented Sep 19, 2025 •

edited

Loading

Uh oh!

ngimel Sep 19, 2025

Uh oh!

danielvegamyhre Sep 19, 2025 •

edited

Loading

Uh oh!

ngimel Sep 19, 2025

Uh oh!

danielvegamyhre Sep 19, 2025

Uh oh!

ngimel Sep 19, 2025

Uh oh!

danielvegamyhre Sep 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielvegamyhre commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Performance

Uh oh!

pytorch-bot Bot commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3034

Uh oh!

Uh oh!

ngimel commented Sep 19, 2025

Uh oh!

danielvegamyhre commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngimel Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

ngimel Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielvegamyhre commented Sep 19, 2025 •

edited

Loading

pytorch-bot Bot commented Sep 19, 2025 •

edited

Loading

danielvegamyhre commented Sep 19, 2025 •

edited

Loading

danielvegamyhre Sep 19, 2025 •

edited

Loading