[XPU] [Feature] [2/3] add fp8 scaled_mm_v2 implementation for XPU by Stonepia · Pull Request #167518 · pytorch/pytorch

Stonepia · 2025-11-11T06:31:02Z

This PR implements scaled_mm_v2 for XPU follows the work in #164141 .

PR stack:

[xpu][feature] [1/3] add fp8 scaled_mm implementation for XPU #165978 : implementation of XPU scaled_mm and oneDNN kernel
-> [XPU] [Feature] [2/3] add fp8 scaled_mm_v2 implementation for XPU #167518 : implementation of XPU scaled_mm_v2
[xpu][feature] [3/3] Register the scaled_mm and scaled_mm_v2 for xpu #166056 : Op registration

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @gujinghui @EikanWang @fengyuan14 @guangyey

pytorch-bot · 2025-11-11T06:31:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167518

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit b7e556b with merge base 4322354 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

xpu / linux-noble-xpu-n-py3.10 / test (default, 3, 12, linux.idc.xpu) (gh) (similar failure)
'Test'

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge, unstable) (gh) (#166072)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Stonepia · 2025-11-11T06:32:24Z

@pytorchbot label "module: xpu" "triaged" "topic: not user facing"

Stonepia · 2025-11-11T07:00:02Z

aten/src/ATen/native/mkldnn/xpu/ScaledBlas.cpp

+    const bool use_fast_accum,
+    Tensor& out,
+    const std::optional<Tensor>& alpha = std::nullopt) {
+  // TODO: scale_result and alpha is not defined or used!


This follows current CUDA. Will refator this code later.

EikanWang · 2025-11-13T02:25:09Z

Pls. fix the CI failure.

@yuchengliu1

This PR implements `scaled_mm` for XPU. It enables the following data types: 1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2` 2. RowWise Scaling: `fp8_e4m3` and `fp8_e5m2` It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts. This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts. Secondly, there is a `scaled_mm_v2` API in #164141 . We will align with it once the v1 is cleaned up. **Co-author:** @yuchengliu1, @carsonwang ## PR stack: - -> #165978 : implementation of XPU scaled_mm and oneDNN kernel - #167518 : implementation of XPU scaled_mm_v2 - #166056 : Op registration ## Test Status: 1. Relies on the changes in intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU. 2. This PR does not include tests, the tests are enabled in #166056. ## Credit: This work is based on @yuchengliu1's work at #140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts. ## FP8 Task tracker: We will track all the scaled_mm related tasks in: #167170 Pull Request resolved: #165978 Approved by: https://github.com/liangan1, https://github.com/EikanWang Co-authored-by: Eikan Wang <eikan.wang@intel.com>

Stonepia · 2025-11-16T21:42:33Z

@pytorchbot rebase

pytorchmergebot · 2025-11-16T21:44:04Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-11-16T21:44:07Z

Successfully rebased tong/scaled_mm_v2 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout tong/scaled_mm_v2 && git pull --rebase)

Stonepia · 2025-11-17T08:57:46Z

@pytorchbot rebase

pytorchmergebot · 2025-11-17T08:59:21Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-11-17T08:59:24Z

Successfully rebased tong/scaled_mm_v2 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout tong/scaled_mm_v2 && git pull --rebase)

Stonepia · 2025-11-17T08:59:48Z

@pytorchbot label "ciflow/xpu"

pytorch-bot · 2025-11-17T08:59:58Z

To add these label(s) (ciflow/xpu) to the PR, please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

Stonepia · 2025-11-17T09:00:23Z

@pytorchbot merge

pytorch-bot · 2025-11-17T09:00:28Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

Stonepia · 2025-11-18T01:31:44Z

The failed CI is because of timeout. This PR does not modify the op nor modify the test, thus the failure should not be related

Stonepia · 2025-11-18T01:31:51Z

@pytorchbot merge

pytorchmergebot · 2025-11-18T01:33:41Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@yuchengliu1

…h#165978) This PR implements `scaled_mm` for XPU. It enables the following data types: 1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2` 2. RowWise Scaling: `fp8_e4m3` and `fp8_e5m2` It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts. This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts. Secondly, there is a `scaled_mm_v2` API in pytorch#164141 . We will align with it once the v1 is cleaned up. **Co-author:** @yuchengliu1, @carsonwang ## PR stack: - -> pytorch#165978 : implementation of XPU scaled_mm and oneDNN kernel - pytorch#167518 : implementation of XPU scaled_mm_v2 - pytorch#166056 : Op registration ## Test Status: 1. Relies on the changes in intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU. 2. This PR does not include tests, the tests are enabled in pytorch#166056. ## Credit: This work is based on @yuchengliu1's work at pytorch#140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts. ## FP8 Task tracker: We will track all the scaled_mm related tasks in: pytorch#167170 Pull Request resolved: pytorch#165978 Approved by: https://github.com/liangan1, https://github.com/EikanWang Co-authored-by: Eikan Wang <eikan.wang@intel.com>

…xpu (#166056) This PR registers the `scaled_mm` op for XPU support. It does the following: 1. Registered the `_scaled_mm` and `_scaled_mm_v2` op for XPU. 2. Enables XPU tests in `test_scaled_matmul_cuda.py`. 3. Update torch-xpu-ops pin to remove fallback `scaled_mm` to CPU implementation. ## PR Stack: - #165978 : implementation of XPU scaled_mm and oneDNN kernel - #167518 : implementation of XPU scaled_mm_v2 - -> #166056 : Op registration ## Task tracker: We will track all the scaled_mm related tasks in: #167170 Pull Request resolved: #166056 Approved by: https://github.com/EikanWang, https://github.com/slayton58, https://github.com/drisspg

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Nov 11, 2025

pytorchbot added the open source label Nov 11, 2025

pytorch-bot bot added module: xpu Intel XPU related issues topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Nov 11, 2025

This was referenced Nov 11, 2025

[xpu][feature] [1/3] add fp8 scaled_mm implementation for XPU #165978

Closed

[xpu][feature] [3/3] Register the scaled_mm and scaled_mm_v2 for xpu #166056

Closed

Stonepia marked this pull request as ready for review November 11, 2025 06:45

Stonepia requested review from EikanWang and gujinghui as code owners November 11, 2025 06:45

Stonepia marked this pull request as draft November 11, 2025 06:56

Stonepia commented Nov 11, 2025

View reviewed changes

Stonepia marked this pull request as ready for review November 11, 2025 07:03

EikanWang approved these changes Nov 11, 2025

View reviewed changes

liangan1 approved these changes Nov 12, 2025

View reviewed changes

Stonepia mentioned this pull request Nov 13, 2025

[XPU][Tracker] Enable torch._scaled_matmul #167170

Open

5 tasks

Stonepia force-pushed the tong/scaled_mm_v2 branch from a632b00 to 714da47 Compare November 14, 2025 06:49

etaf added the ciflow/xpu Run XPU CI tasks label Nov 14, 2025

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Nov 14, 2025

etaf added the ciflow/xpu Run XPU CI tasks label Nov 14, 2025

pytorchmergebot force-pushed the tong/scaled_mm_v2 branch from dc90b44 to 8be6ac3 Compare November 16, 2025 21:44

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Nov 16, 2025

etaf added the ciflow/xpu Run XPU CI tasks label Nov 16, 2025

pytorchmergebot removed the merging label Nov 17, 2025

Stonepia added 4 commits November 17, 2025 08:59

Add scaled_mm_v2

87d8ee7

Update ScaledBlas.cpp

b44264e

update

3c8cb15

Delete the rebased error

b7e556b

pytorchmergebot force-pushed the tong/scaled_mm_v2 branch from 8be6ac3 to b7e556b Compare November 17, 2025 08:59

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels Nov 17, 2025

chuanqi129 added keep-going Don't stop on first failure, keep running tests until the end ciflow/xpu Run XPU CI tasks labels Nov 17, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 18, 2025

pytorchmergebot added the merging label Nov 18, 2025

pytorchmergebot added the Merged label Nov 18, 2025

pytorchmergebot closed this in 7ffeb34 Nov 18, 2025

Stonepia deleted the tong/scaled_mm_v2 branch November 18, 2025 03:27

pytorchmergebot removed the merging label Nov 18, 2025

Conversation

Stonepia commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR stack:

Uh oh!

pytorch-bot bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167518

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

Stonepia commented Nov 11, 2025

Uh oh!

Stonepia Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang commented Nov 13, 2025

Uh oh!

Stonepia commented Nov 16, 2025

Uh oh!

pytorchmergebot commented Nov 16, 2025

Uh oh!

pytorchmergebot commented Nov 16, 2025

Uh oh!

Stonepia commented Nov 17, 2025

Uh oh!

pytorchmergebot commented Nov 17, 2025

Uh oh!

pytorchmergebot commented Nov 17, 2025

Uh oh!

Stonepia commented Nov 17, 2025

Uh oh!

pytorch-bot bot commented Nov 17, 2025

Uh oh!

Stonepia commented Nov 17, 2025

Uh oh!

pytorch-bot bot commented Nov 17, 2025

Uh oh!

Stonepia commented Nov 18, 2025

Uh oh!

Stonepia commented Nov 18, 2025

Uh oh!

pytorchmergebot commented Nov 18, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Stonepia commented Nov 11, 2025 •

edited

Loading

pytorch-bot bot commented Nov 11, 2025 •

edited

Loading