integration of new mxfp8 casting cuda kernel by danielvegamyhre · Pull Request #2564 · pytorch/ao

danielvegamyhre · 2025-07-16T21:14:45Z

Stacked PRs:

->integration of new mxfp8 casting cuda kernel #2564

integration of new mxfp8 casting cuda kernel

Summary

Integrating kernel added in #2513. Custom op wrapper was recently added in #2543. Remaining code to migrate from my private repo is in this PR:

Register sharding strategy for mxfp8_quantize_cuda custom op.
Add wrapper with Dtensor handling for mxfp8_quantize_cuda custom op
Update triton_scale_swizzle kernel to accept both row major and col major inputs (since cuda kernel writes scale in col major, to avoid uncoalesced global accesses)
Add MXFP8Dim1CastKernelChoice enum and replace all uses of boolean flag use_fp8_dim1_cast_triton_kernel with it. (Default to Triton for now)
Update tests accordingly and verify they are passing.

Test plan

pytest test/prototype/mx_formats/test_mx_linear.py -k eager_vs_hp
pytest test/prototype/mx_formats/test_mx_linear.py -k compile

Next steps

Integrate into torchtitan for e2e fsdp training tests once this stack lands. Torchtitan PR: [mxpf8] Make mxfp8 dim1 cast kernel configurable torchtitan#1401
Dtensor tests still having issues both with Triton and CUDA: ./test/prototype/mx_formats/test_mx_dtensor.sh: RuntimeError: Attempting to use FunctionalTensor on its own. Instead, please use it with a corresponding FunctionalTensorMode(). This is a known issue, will follow up on it.

stack-info: PR: #2564, branch: danielvegamyhre/stack/13

pytorch-bot · 2025-07-16T21:14:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2564

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cc00ef6 with merge base 95d13d5 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #2564, branch: danielvegamyhre/stack/13

danielvegamyhre · 2025-07-16T22:52:47Z

@vkuzo @drisspg I squashed the remaining changes in the original stack into the same PR, so that the tests would be in the same PR as the changes.

drisspg · 2025-07-16T22:59:30Z

Didn't read any code yet:

Update triton_scale_swizzle kernel to accept both row major and col major inputs (since cuda kernel writes scale in col major, to avoid uncoalesced global accesses)

I think updating to just accept generically strided inputs and writing out row major (required for mm kernels) is good, is that what you did?

stack-info: PR: #2564, branch: danielvegamyhre/stack/13

danielvegamyhre · 2025-07-16T23:09:15Z

Didn't read any code yet:
Update triton_scale_swizzle kernel to accept both row major and col major inputs (since cuda kernel writes scale in col major, to avoid uncoalesced global accesses)
I think updating to just accept generically strided inputs and writing out row major (required for mm kernels) is good, is that what you did?

Yep, that's correct

stack-info: PR: #2564, branch: danielvegamyhre/stack/13

vkuzo · 2025-07-17T11:30:01Z

 from tqdm import tqdm

 from torchao.prototype.mx_formats import MXLinearConfig
+from torchao.prototype.mx_formats.config import MXFP8CastKernelChoice


sorry, MXFP8Dim1CastKernelChoice? since for dim0 we are always using torch.compile

It was originally MXFP8Dim1CastKernelChoice but here we discussed naming it MXFP8CastKernelChoice for potentially adding support for dim0 and dim0+dim1 casts as well. I don't have strong opinions either way, I went ahead and changed it back to MXFP8Dim1CastKernelChoice

vkuzo · 2025-07-17T11:31:07Z

    CUBLAS = "cublas"


+class MXFP8CastKernelChoice(Enum):


add some comments on what the options are?

vkuzo · 2025-07-17T11:32:37Z

+            a,
+            rowwise=False,
+            colwise=True,
+            scaling_mode="floor",


TODO for later to allow choice of scaling modes

Added todo on the custom op itself, with an explanation why we currently are using a string param

stack-info: PR: #2564, branch: danielvegamyhre/stack/13

Stacked PRs: * __->__#1427 --- --- --- make mxfp8 dim1 cast kernel configurable ## Summary - We recently added a new CUDA kernel for the mxfp8 dim1 cast which is ~1.4x faster than the existing Triton kernel or torch.compile, and using it results in an e2e training speedup of +1.5-2.5% TPS with Llama3 8b using FSDP=4/8 (pytorch/ao#2513). The integration work for composability with torch.compile + FSDP is complete as well: pytorch/ao#2564 - This PR updates the mxfp8 user facing API to replace the boolean flag `"--mx.use_triton_for_dim1_cast=[true|false]` to `mxfp8_dim1_cast_kernel_choice=[triton|cuda|torch]` ## Test plan - Triton: `NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --training.steps=100 --model.converters="mx" --mx.recipe_name="mxfp8" --training.compile --mx.mxfp8_dim1_cast_kernel_choice="triton"` - Cuda: `NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --training.steps=100 --model.converters="mx" --mx.recipe_name="mxfp8" --training.compile --mx.mxfp8_dim1_cast_kernel_choice="cuda"` - Torch: `NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --training.steps=100 --model.converters="mx" --mx.recipe_name="mxfp8" --training.compile --mx.mxfp8_dim1_cast_kernel_choice="torch"` ## Limitations - TP is currently not supported yet, as both the Triton kernel and CUDA kernel are affected by an issue: `RuntimeError: Attempting to use FunctionalTensor on its own. Instead, please use it with a corresponding FunctionalTensorMode()`. This is a known issue we were talking to Brian about, will continue following up on it.

danielvegamyhre added a commit that referenced this pull request Jul 16, 2025

integration of new mxfp8 casting cuda kernel

3f88897

stack-info: PR: #2564, branch: danielvegamyhre/stack/13

danielvegamyhre force-pushed the danielvegamyhre/stack/13 branch from 651b912 to 3f88897 Compare July 16, 2025 21:14

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 16, 2025

danielvegamyhre added the module: not user facing Use this tag if you don't want this PR to show up in release notes label Jul 16, 2025

danielvegamyhre added a commit that referenced this pull request Jul 16, 2025

integration of new mxfp8 casting cuda kernel

20847a7

stack-info: PR: #2564, branch: danielvegamyhre/stack/13

danielvegamyhre force-pushed the danielvegamyhre/stack/13 branch from 3f88897 to 20847a7 Compare July 16, 2025 21:31

danielvegamyhre added a commit that referenced this pull request Jul 16, 2025

integration of new mxfp8 casting cuda kernel

b1ed196

stack-info: PR: #2564, branch: danielvegamyhre/stack/13

danielvegamyhre force-pushed the danielvegamyhre/stack/13 branch from 20847a7 to b1ed196 Compare July 16, 2025 21:39

danielvegamyhre added a commit that referenced this pull request Jul 16, 2025

integration of new mxfp8 casting cuda kernel

bb930b6

stack-info: PR: #2564, branch: danielvegamyhre/stack/13

danielvegamyhre force-pushed the danielvegamyhre/stack/13 branch from b1ed196 to bb930b6 Compare July 16, 2025 21:56

danielvegamyhre requested a review from vkuzo July 16, 2025 22:49

danielvegamyhre added a commit that referenced this pull request Jul 16, 2025

integration of new mxfp8 casting cuda kernel

b1e237f

stack-info: PR: #2564, branch: danielvegamyhre/stack/13

danielvegamyhre force-pushed the danielvegamyhre/stack/13 branch from bb930b6 to b1e237f Compare July 16, 2025 23:07

danielvegamyhre added a commit that referenced this pull request Jul 16, 2025

integration of new mxfp8 casting cuda kernel

c9c9da0

stack-info: PR: #2564, branch: danielvegamyhre/stack/13

danielvegamyhre force-pushed the danielvegamyhre/stack/13 branch from b1e237f to c9c9da0 Compare July 16, 2025 23:12

vkuzo reviewed Jul 17, 2025

View reviewed changes

vkuzo approved these changes Jul 17, 2025

View reviewed changes

danielvegamyhre force-pushed the danielvegamyhre/stack/13 branch from c9c9da0 to 7b1f899 Compare July 17, 2025 23:33

danielvegamyhre added a commit that referenced this pull request Jul 17, 2025

integration of new mxfp8 casting cuda kernel

7b1f899

stack-info: PR: #2564, branch: danielvegamyhre/stack/13

integration of new mxfp8 casting cuda kernel

cc00ef6

stack-info: PR: #2564, branch: danielvegamyhre/stack/13

danielvegamyhre force-pushed the danielvegamyhre/stack/13 branch from 7b1f899 to cc00ef6 Compare July 17, 2025 23:40

danielvegamyhre merged commit d828f91 into main Jul 18, 2025
19 checks passed

This was referenced Jul 18, 2025

[mxpf8] Make mxfp8 dim1 cast kernel configurable pytorch/torchtitan#1401

Closed

make mxfp8 dim1 cast kernel configurable pytorch/torchtitan#1427

Merged

liangel-02 pushed a commit that referenced this pull request Aug 25, 2025

integration of new mxfp8 casting cuda kernel (#2564)

e237b9a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integration of new mxfp8 casting cuda kernel#2564

integration of new mxfp8 casting cuda kernel#2564
danielvegamyhre merged 1 commit into
mainfrom
danielvegamyhre/stack/13

danielvegamyhre commented Jul 16, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jul 16, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Jul 16, 2025

Uh oh!

drisspg commented Jul 16, 2025

Uh oh!

danielvegamyhre commented Jul 16, 2025

Uh oh!

vkuzo Jul 17, 2025

Uh oh!

danielvegamyhre Jul 17, 2025 •

edited

Loading

Uh oh!

vkuzo Jul 17, 2025

Uh oh!

vkuzo Jul 17, 2025

Uh oh!

danielvegamyhre Jul 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielvegamyhre commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!