[xpu][feature] [1/3] add fp8 scaled_mm implementation for XPU by Stonepia · Pull Request #165978 · pytorch/pytorch

Stonepia · 2025-10-21T08:36:18Z

This PR implements scaled_mm for XPU. It enables the following data types:

TensorWise Scaling: fp8_e4m3 and fp8_e5m2
RowWise Scaling: fp8_e4m3 and fp8_e5m2

It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts.

This is the first PR that only adds scaled_mm_xpu but does not registered. We separate this out for less reviewing efforts.

Secondly, there is a scaled_mm_v2 API in #164141 . We will align with it once the v1 is cleaned up.

Co-author: @yuchengliu1, @carsonwang

PR stack:

-> [xpu][feature] [1/3] add fp8 scaled_mm implementation for XPU #165978 : implementation of XPU scaled_mm and oneDNN kernel
[XPU] [Feature] [2/3] add fp8 scaled_mm_v2 implementation for XPU #167518 : implementation of XPU scaled_mm_v2
[xpu][feature] [3/3] Register the scaled_mm and scaled_mm_v2 for xpu #166056 : Op registration

Test Status:

Relies on the changes in remove scaled_mm fallback intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU.
This PR does not include tests, the tests are enabled in [xpu][feature] [3/3] Register the scaled_mm and scaled_mm_v2 for xpu #166056.

Credit:

This work is based on @yuchengliu1's work at #140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts.

FP8 Task tracker:

We will track all the scaled_mm related tasks in: #167170

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @gujinghui @EikanWang @fengyuan14 @guangyey

pytorch-bot · 2025-10-21T08:36:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165978

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 949d894 with merge base b91a2ab ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-10-21T08:40:47Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

github-actions · 2025-10-21T08:40:48Z

Attention! PyTorch one of the C-stable API file was changed

You MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function.

Caused by:

torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h

liangan1 · 2025-10-22T00:24:02Z

@Stonepia This PR is still very large and the review effort is heavy. Only the tensor- and row-wise scaling is supported in this PR, suggest to remove the unsupported scaling format process, but keep the design extendable to add more scaling format.

Stonepia · 2025-10-22T06:50:52Z

@pytorchbot label "module: xpu"

Stonepia · 2025-10-22T07:09:19Z

@pytorchbot rebase

pytorchmergebot · 2025-10-22T07:11:01Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-10-22T07:11:05Z

Successfully rebased tong/xpu_scaled_mm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout tong/xpu_scaled_mm && git pull --rebase)

EikanWang · 2025-11-12T13:22:50Z

@pytorchbot merge

pytorchmergebot · 2025-11-12T13:25:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-11-12T18:38:02Z

Merge failed

Reason: 1 jobs have failed, first few of them are: xpu / linux-noble-xpu-n-py3.10 / test (default, 6, 12, linux.idc.xpu)

Details for Dev Infra team

Raised by workflow job

Stonepia · 2025-11-14T01:08:29Z

@pytorchbot merge

pytorchmergebot · 2025-11-14T01:10:30Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-11-14T01:10:52Z

Merge failed

Reason: 1 jobs have failed, first few of them are: xpu / linux-noble-xpu-n-py3.10 / test (default, 6, 12, linux.idc.xpu)

Details for Dev Infra team

Raised by workflow job

Stonepia · 2025-11-14T06:33:47Z

@pytorchbot merge

pytorchmergebot · 2025-11-14T06:35:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…67518) This PR implements `scaled_mm_v2` for XPU follows the work in #164141 . ## PR stack: - #165978 : implementation of XPU scaled_mm and oneDNN kernel - -> #167518 : implementation of XPU scaled_mm_v2 - #166056 : Op registration Pull Request resolved: #167518 Approved by: https://github.com/EikanWang, https://github.com/liangan1

@yuchengliu1

…h#165978) This PR implements `scaled_mm` for XPU. It enables the following data types: 1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2` 2. RowWise Scaling: `fp8_e4m3` and `fp8_e5m2` It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts. This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts. Secondly, there is a `scaled_mm_v2` API in pytorch#164141 . We will align with it once the v1 is cleaned up. **Co-author:** @yuchengliu1, @carsonwang ## PR stack: - -> pytorch#165978 : implementation of XPU scaled_mm and oneDNN kernel - pytorch#167518 : implementation of XPU scaled_mm_v2 - pytorch#166056 : Op registration ## Test Status: 1. Relies on the changes in intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU. 2. This PR does not include tests, the tests are enabled in pytorch#166056. ## Credit: This work is based on @yuchengliu1's work at pytorch#140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts. ## FP8 Task tracker: We will track all the scaled_mm related tasks in: pytorch#167170 Pull Request resolved: pytorch#165978 Approved by: https://github.com/liangan1, https://github.com/EikanWang Co-authored-by: Eikan Wang <eikan.wang@intel.com>

…xpu (#166056) This PR registers the `scaled_mm` op for XPU support. It does the following: 1. Registered the `_scaled_mm` and `_scaled_mm_v2` op for XPU. 2. Enables XPU tests in `test_scaled_matmul_cuda.py`. 3. Update torch-xpu-ops pin to remove fallback `scaled_mm` to CPU implementation. ## PR Stack: - #165978 : implementation of XPU scaled_mm and oneDNN kernel - #167518 : implementation of XPU scaled_mm_v2 - -> #166056 : Op registration ## Task tracker: We will track all the scaled_mm related tasks in: #167170 Pull Request resolved: #166056 Approved by: https://github.com/EikanWang, https://github.com/slayton58, https://github.com/drisspg

Stonepia requested review from EikanWang and gujinghui as code owners October 21, 2025 08:36

pytorch-bot Bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: inductor (aoti) labels Oct 21, 2025

Stonepia marked this pull request as draft October 21, 2025 08:38

pytorchbot added the open source label Oct 21, 2025

liangan1 reviewed Oct 22, 2025

View reviewed changes

Comment thread aten/src/ATen/native/mkldnn/xpu/Blas.cpp Outdated

liangan1 reviewed Oct 22, 2025

View reviewed changes

Comment thread aten/src/ATen/native/mkldnn/xpu/detail/QMatmul.cpp Outdated

liangan1 reviewed Oct 22, 2025

View reviewed changes

Comment thread aten/src/ATen/native/mkldnn/xpu/detail/QMatmul.cpp Outdated

liangan1 reviewed Oct 22, 2025

View reviewed changes

Comment thread aten/src/ATen/native/mkldnn/xpu/detail/QMatmul.cpp Outdated

pytorch-bot Bot added the module: xpu Intel XPU related issues label Oct 22, 2025

guangyey added this to PyTorch Intel Oct 22, 2025

guangyey reviewed Oct 22, 2025

View reviewed changes

Comment thread aten/src/ATen/native/mkldnn/xpu/Blas.cpp Outdated

Stonepia changed the title ~~[XPU] [Draft] add fp8 scaled_mm for XPU~~ [XPU] [1/2] add fp8 scaled_mm for XPU Oct 22, 2025

pytorchmergebot force-pushed the tong/xpu_scaled_mm branch from 435d529 to 5814f66 Compare October 22, 2025 07:11

guangyey reviewed Oct 22, 2025

View reviewed changes

Comment thread aten/src/ATen/native/mkldnn/xpu/Blas.cpp Outdated

chuanqi129 linked an issue Oct 22, 2025 that may be closed by this pull request

[Evaluated] FP8 support in matmul // Issues in test_matmul_cuda.py due to FP8 intel/torch-xpu-ops#322

Closed

Stonepia mentioned this pull request Oct 22, 2025

[xpu][feature] [3/3] Register the scaled_mm and scaled_mm_v2 for xpu #166056

Closed

liangan1 reviewed Oct 22, 2025

View reviewed changes

Comment thread aten/src/ATen/native/mkldnn/xpu/detail/Attr.h Outdated

Stonepia changed the title ~~[XPU] [1/2] add fp8 scaled_mm for XPU~~ [XPU] [1/2] add fp8 scaled_mm implementation for XPU Oct 22, 2025

liangan1 reviewed Oct 22, 2025

View reviewed changes

Comment thread aten/src/ATen/native/mkldnn/xpu/detail/Attr.h Outdated

liangan1 reviewed Oct 22, 2025

View reviewed changes

Comment thread aten/src/ATen/native/mkldnn/xpu/detail/QMatmul.cpp

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 12, 2025

pytorchmergebot added the merging label Nov 12, 2025

pytorchmergebot removed the merging label Nov 12, 2025

liangan1 mentioned this pull request Nov 13, 2025

Add _weight_fp8pack_mm for A16W8 support #161045

Closed

etaf changed the title ~~[XPU] [Feature] [1/3] add fp8 scaled_mm implementation for XPU~~ [xpu feature] [1/3] add fp8 scaled_mm implementation for XPU Nov 13, 2025

pytorchmergebot added the merging label Nov 14, 2025

pytorchmergebot removed the merging label Nov 14, 2025

Stonepia changed the title ~~[xpu feature] [1/3] add fp8 scaled_mm implementation for XPU~~ [xpu][feature] [1/3] add fp8 scaled_mm implementation for XPU Nov 14, 2025

pytorchmergebot added the merging label Nov 14, 2025

pytorchmergebot added the Merged label Nov 14, 2025

pytorchmergebot closed this in 0e7235e Nov 14, 2025

github-project-automation Bot moved this to Done in PyTorch Intel Nov 14, 2025

Stonepia deleted the tong/xpu_scaled_mm branch November 14, 2025 06:41

pytorchmergebot removed the merging label Nov 14, 2025

liangan1 mentioned this pull request Nov 18, 2025

[TorchAO] MX training native PyTorch on XPU intel/torch-xpu-ops#2326

Open

carsonwang mentioned this pull request Nov 18, 2025

remove scaled_mm fallback intel/torch-xpu-ops#1746

Merged

liangan1 mentioned this pull request Nov 18, 2025

[TorchAO] Float8 training support on XPU intel/torch-xpu-ops#2325

Open

Conversation

Stonepia commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR stack:

Test Status:

Credit:

FP8 Task tracker:

Uh oh!

pytorch-bot Bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165978

✅ No Failures

Uh oh!

github-actions Bot commented Oct 21, 2025

Attention! native_functions.yaml was changed

Uh oh!

github-actions Bot commented Oct 21, 2025

Attention! PyTorch one of the C-stable API file was changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liangan1 commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Stonepia commented Oct 22, 2025

Uh oh!

Uh oh!

Stonepia commented Oct 22, 2025

Uh oh!

pytorchmergebot commented Oct 22, 2025

Uh oh!

pytorchmergebot commented Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EikanWang commented Nov 12, 2025

Uh oh!

pytorchmergebot commented Nov 12, 2025

Merge started

Uh oh!

pytorchmergebot commented Nov 12, 2025

Merge failed

Uh oh!

Stonepia commented Nov 14, 2025

Uh oh!

pytorchmergebot commented Nov 14, 2025

Merge started

Uh oh!

pytorchmergebot commented Nov 14, 2025

Merge failed

Uh oh!

Stonepia commented Nov 14, 2025

Uh oh!

pytorchmergebot commented Nov 14, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Stonepia commented Oct 21, 2025 •

edited

Loading

pytorch-bot Bot commented Oct 21, 2025 •

edited

Loading

liangan1 commented Oct 22, 2025 •

edited

Loading