Skip to content

[XPU] [Feature] [2/3] add fp8 scaled_mm_v2 implementation for XPU#167518

Closed
Stonepia wants to merge 4 commits intopytorch:mainfrom
Stonepia:tong/scaled_mm_v2
Closed

[XPU] [Feature] [2/3] add fp8 scaled_mm_v2 implementation for XPU#167518
Stonepia wants to merge 4 commits intopytorch:mainfrom
Stonepia:tong/scaled_mm_v2

Conversation

@Stonepia
Copy link
Contributor

@Stonepia Stonepia commented Nov 11, 2025

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167518

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit b7e556b with merge base 4322354 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Nov 11, 2025
@Stonepia
Copy link
Contributor Author

@pytorchbot label "module: xpu" "triaged" "topic: not user facing"

@pytorch-bot pytorch-bot bot added module: xpu Intel XPU related issues topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Nov 11, 2025
@Stonepia Stonepia marked this pull request as ready for review November 11, 2025 06:45
@Stonepia Stonepia marked this pull request as draft November 11, 2025 06:56
const bool use_fast_accum,
Tensor& out,
const std::optional<Tensor>& alpha = std::nullopt) {
// TODO: scale_result and alpha is not defined or used!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows current CUDA. Will refator this code later.

@Stonepia Stonepia marked this pull request as ready for review November 11, 2025 07:03
@EikanWang
Copy link
Collaborator

Pls. fix the CI failure.

pytorchmergebot pushed a commit that referenced this pull request Nov 14, 2025
This PR implements `scaled_mm` for XPU. It enables the following data types:
1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2`
2. RowWise Scaling:  `fp8_e4m3` and `fp8_e5m2`

It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts.

This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts.

Secondly, there is a `scaled_mm_v2` API in #164141 . We will align with it once the v1 is cleaned up.

**Co-author:** @yuchengliu1, @carsonwang

## PR stack:

- -> #165978 : implementation of XPU scaled_mm and oneDNN kernel
- #167518 : implementation of XPU scaled_mm_v2
- #166056 : Op registration

## Test Status:

1. Relies on the changes in intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU.
2. This PR does not include tests, the tests are enabled in #166056.

## Credit:

This work is based on @yuchengliu1's work at #140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts.

## FP8 Task tracker:
We will track all the scaled_mm related tasks in: #167170

Pull Request resolved: #165978
Approved by: https://github.com/liangan1, https://github.com/EikanWang

Co-authored-by: Eikan Wang <eikan.wang@intel.com>
@etaf etaf added the ciflow/xpu Run XPU CI tasks label Nov 14, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Nov 14, 2025
@etaf etaf added the ciflow/xpu Run XPU CI tasks label Nov 14, 2025
@Stonepia
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased tong/scaled_mm_v2 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout tong/scaled_mm_v2 && git pull --rebase)

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Nov 16, 2025
@etaf etaf added the ciflow/xpu Run XPU CI tasks label Nov 16, 2025
@Stonepia
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased tong/scaled_mm_v2 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout tong/scaled_mm_v2 && git pull --rebase)

@pytorch-bot pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels Nov 17, 2025
@Stonepia
Copy link
Contributor Author

@pytorchbot label "ciflow/xpu"

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 17, 2025

To add these label(s) (ciflow/xpu) to the PR, please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@Stonepia
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 17, 2025

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@chuanqi129 chuanqi129 added keep-going Don't stop on first failure, keep running tests until the end ciflow/xpu Run XPU CI tasks labels Nov 17, 2025
@Stonepia
Copy link
Contributor Author

The failed CI is because of timeout. This PR does not modify the op nor modify the test, thus the failure should not be related

@Stonepia
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 18, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Stonepia Stonepia deleted the tong/scaled_mm_v2 branch November 18, 2025 03:27
Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request Nov 18, 2025
…h#165978)

This PR implements `scaled_mm` for XPU. It enables the following data types:
1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2`
2. RowWise Scaling:  `fp8_e4m3` and `fp8_e5m2`

It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts.

This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts.

Secondly, there is a `scaled_mm_v2` API in pytorch#164141 . We will align with it once the v1 is cleaned up.

**Co-author:** @yuchengliu1, @carsonwang

## PR stack:

- -> pytorch#165978 : implementation of XPU scaled_mm and oneDNN kernel
- pytorch#167518 : implementation of XPU scaled_mm_v2
- pytorch#166056 : Op registration

## Test Status:

1. Relies on the changes in intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU.
2. This PR does not include tests, the tests are enabled in pytorch#166056.

## Credit:

This work is based on @yuchengliu1's work at pytorch#140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts.

## FP8 Task tracker:
We will track all the scaled_mm related tasks in: pytorch#167170

Pull Request resolved: pytorch#165978
Approved by: https://github.com/liangan1, https://github.com/EikanWang

Co-authored-by: Eikan Wang <eikan.wang@intel.com>
pytorchmergebot pushed a commit that referenced this pull request Dec 3, 2025
…xpu (#166056)

This PR registers the `scaled_mm` op for XPU support.

It does the following:
1. Registered the `_scaled_mm` and `_scaled_mm_v2` op for XPU.
2. Enables XPU tests in `test_scaled_matmul_cuda.py`.
3. Update torch-xpu-ops pin to remove fallback `scaled_mm` to CPU implementation.

## PR Stack:
- #165978 : implementation of XPU scaled_mm and oneDNN kernel
- #167518 : implementation of XPU scaled_mm_v2
- -> #166056 : Op registration

## Task tracker:
We will track all the scaled_mm related tasks in: #167170

Pull Request resolved: #166056
Approved by: https://github.com/EikanWang, https://github.com/slayton58, https://github.com/drisspg
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
…xpu (#166056)

This PR registers the `scaled_mm` op for XPU support.

It does the following:
1. Registered the `_scaled_mm` and `_scaled_mm_v2` op for XPU.
2. Enables XPU tests in `test_scaled_matmul_cuda.py`.
3. Update torch-xpu-ops pin to remove fallback `scaled_mm` to CPU implementation.

## PR Stack:
- #165978 : implementation of XPU scaled_mm and oneDNN kernel
- #167518 : implementation of XPU scaled_mm_v2
- -> #166056 : Op registration

## Task tracker:
We will track all the scaled_mm related tasks in: #167170

Pull Request resolved: #166056
Approved by: https://github.com/EikanWang, https://github.com/slayton58, https://github.com/drisspg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks keep-going Don't stop on first failure, keep running tests until the end Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: xpu Intel XPU related issues open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants