Skip to content

[xpu][feature] [1/3] add fp8 scaled_mm implementation for XPU#165978

Closed
Stonepia wants to merge 26 commits intopytorch:mainfrom
Stonepia:tong/xpu_scaled_mm
Closed

[xpu][feature] [1/3] add fp8 scaled_mm implementation for XPU#165978
Stonepia wants to merge 26 commits intopytorch:mainfrom
Stonepia:tong/xpu_scaled_mm

Conversation

@Stonepia
Copy link
Copy Markdown
Contributor

@Stonepia Stonepia commented Oct 21, 2025

This PR implements scaled_mm for XPU. It enables the following data types:

  1. TensorWise Scaling: fp8_e4m3 and fp8_e5m2
  2. RowWise Scaling: fp8_e4m3 and fp8_e5m2

It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts.

This is the first PR that only adds scaled_mm_xpu but does not registered. We separate this out for less reviewing efforts.

Secondly, there is a scaled_mm_v2 API in #164141 . We will align with it once the v1 is cleaned up.

Co-author: @yuchengliu1, @carsonwang

PR stack:

Test Status:

  1. Relies on the changes in remove scaled_mm fallback intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU.
  2. This PR does not include tests, the tests are enabled in [xpu][feature] [3/3] Register the scaled_mm and scaled_mm_v2 for xpu #166056.

Credit:

This work is based on @yuchengliu1's work at #140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts.

FP8 Task tracker:

We will track all the scaled_mm related tasks in: #167170

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @gujinghui @EikanWang @fengyuan14 @guangyey

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Oct 21, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165978

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 949d894 with merge base b91a2ab (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot Bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: inductor (aoti) labels Oct 21, 2025
@Stonepia Stonepia marked this pull request as draft October 21, 2025 08:38
@github-actions
Copy link
Copy Markdown
Contributor

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.


Caused by:

@github-actions
Copy link
Copy Markdown
Contributor

Attention! PyTorch one of the C-stable API file was changed

You MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function.


Caused by:

Comment thread aten/src/ATen/native/mkldnn/xpu/Blas.cpp Outdated
Comment thread aten/src/ATen/native/mkldnn/xpu/detail/QMatmul.cpp Outdated
Comment thread aten/src/ATen/native/mkldnn/xpu/detail/QMatmul.cpp Outdated
Comment thread aten/src/ATen/native/mkldnn/xpu/detail/QMatmul.cpp Outdated
@liangan1
Copy link
Copy Markdown
Contributor

liangan1 commented Oct 22, 2025

@Stonepia This PR is still very large and the review effort is heavy. Only the tensor- and row-wise scaling is supported in this PR, suggest to remove the unsupported scaling format process, but keep the design extendable to add more scaling format.

@Stonepia
Copy link
Copy Markdown
Contributor Author

@pytorchbot label "module: xpu"

@pytorch-bot pytorch-bot Bot added the module: xpu Intel XPU related issues label Oct 22, 2025
Comment thread aten/src/ATen/native/mkldnn/xpu/Blas.cpp Outdated
@Stonepia Stonepia changed the title [XPU] [Draft] add fp8 scaled_mm for XPU [XPU] [1/2] add fp8 scaled_mm for XPU Oct 22, 2025
@Stonepia
Copy link
Copy Markdown
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased tong/xpu_scaled_mm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout tong/xpu_scaled_mm && git pull --rebase)

Comment thread aten/src/ATen/native/mkldnn/xpu/Blas.cpp Outdated
Comment thread aten/src/ATen/native/mkldnn/xpu/detail/Attr.h Outdated
@Stonepia Stonepia changed the title [XPU] [1/2] add fp8 scaled_mm for XPU [XPU] [1/2] add fp8 scaled_mm implementation for XPU Oct 22, 2025
Comment thread aten/src/ATen/native/mkldnn/xpu/detail/Attr.h Outdated
Comment thread aten/src/ATen/native/mkldnn/xpu/detail/QMatmul.cpp
@EikanWang
Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 12, 2025
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: xpu / linux-noble-xpu-n-py3.10 / test (default, 6, 12, linux.idc.xpu)

Details for Dev Infra team Raised by workflow job

@etaf etaf changed the title [XPU] [Feature] [1/3] add fp8 scaled_mm implementation for XPU [xpu feature] [1/3] add fp8 scaled_mm implementation for XPU Nov 13, 2025
@Stonepia
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: xpu / linux-noble-xpu-n-py3.10 / test (default, 6, 12, linux.idc.xpu)

Details for Dev Infra team Raised by workflow job

@Stonepia Stonepia changed the title [xpu feature] [1/3] add fp8 scaled_mm implementation for XPU [xpu][feature] [1/3] add fp8 scaled_mm implementation for XPU Nov 14, 2025
@Stonepia
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Stonepia Stonepia deleted the tong/xpu_scaled_mm branch November 14, 2025 06:41
pytorchmergebot pushed a commit that referenced this pull request Nov 18, 2025
…67518)

This PR implements `scaled_mm_v2` for XPU follows the work in #164141 .
## PR stack:

- #165978 : implementation of XPU scaled_mm and oneDNN kernel
- -> #167518 : implementation of XPU scaled_mm_v2
- #166056 : Op registration

Pull Request resolved: #167518
Approved by: https://github.com/EikanWang, https://github.com/liangan1
Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request Nov 18, 2025
…h#165978)

This PR implements `scaled_mm` for XPU. It enables the following data types:
1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2`
2. RowWise Scaling:  `fp8_e4m3` and `fp8_e5m2`

It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts.

This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts.

Secondly, there is a `scaled_mm_v2` API in pytorch#164141 . We will align with it once the v1 is cleaned up.

**Co-author:** @yuchengliu1, @carsonwang

## PR stack:

- -> pytorch#165978 : implementation of XPU scaled_mm and oneDNN kernel
- pytorch#167518 : implementation of XPU scaled_mm_v2
- pytorch#166056 : Op registration

## Test Status:

1. Relies on the changes in intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU.
2. This PR does not include tests, the tests are enabled in pytorch#166056.

## Credit:

This work is based on @yuchengliu1's work at pytorch#140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts.

## FP8 Task tracker:
We will track all the scaled_mm related tasks in: pytorch#167170

Pull Request resolved: pytorch#165978
Approved by: https://github.com/liangan1, https://github.com/EikanWang

Co-authored-by: Eikan Wang <eikan.wang@intel.com>
pytorchmergebot pushed a commit that referenced this pull request Dec 3, 2025
…xpu (#166056)

This PR registers the `scaled_mm` op for XPU support.

It does the following:
1. Registered the `_scaled_mm` and `_scaled_mm_v2` op for XPU.
2. Enables XPU tests in `test_scaled_matmul_cuda.py`.
3. Update torch-xpu-ops pin to remove fallback `scaled_mm` to CPU implementation.

## PR Stack:
- #165978 : implementation of XPU scaled_mm and oneDNN kernel
- #167518 : implementation of XPU scaled_mm_v2
- -> #166056 : Op registration

## Task tracker:
We will track all the scaled_mm related tasks in: #167170

Pull Request resolved: #166056
Approved by: https://github.com/EikanWang, https://github.com/slayton58, https://github.com/drisspg
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
…xpu (#166056)

This PR registers the `scaled_mm` op for XPU support.

It does the following:
1. Registered the `_scaled_mm` and `_scaled_mm_v2` op for XPU.
2. Enables XPU tests in `test_scaled_matmul_cuda.py`.
3. Update torch-xpu-ops pin to remove fallback `scaled_mm` to CPU implementation.

## PR Stack:
- #165978 : implementation of XPU scaled_mm and oneDNN kernel
- #167518 : implementation of XPU scaled_mm_v2
- -> #166056 : Op registration

## Task tracker:
We will track all the scaled_mm related tasks in: #167170

Pull Request resolved: #166056
Approved by: https://github.com/EikanWang, https://github.com/slayton58, https://github.com/drisspg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: xpu Intel XPU related issues open source release notes: inductor (aoti) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[Evaluated] FP8 support in matmul // Issues in test_matmul_cuda.py due to FP8

8 participants