[XPU] [Feature] [2/3] add fp8 scaled_mm_v2 implementation for XPU#167518
[XPU] [Feature] [2/3] add fp8 scaled_mm_v2 implementation for XPU#167518Stonepia wants to merge 4 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167518
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit b7e556b with merge base 4322354 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "module: xpu" "triaged" "topic: not user facing" |
| const bool use_fast_accum, | ||
| Tensor& out, | ||
| const std::optional<Tensor>& alpha = std::nullopt) { | ||
| // TODO: scale_result and alpha is not defined or used! |
There was a problem hiding this comment.
This follows current CUDA. Will refator this code later.
|
Pls. fix the CI failure. |
This PR implements `scaled_mm` for XPU. It enables the following data types: 1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2` 2. RowWise Scaling: `fp8_e4m3` and `fp8_e5m2` It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts. This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts. Secondly, there is a `scaled_mm_v2` API in #164141 . We will align with it once the v1 is cleaned up. **Co-author:** @yuchengliu1, @carsonwang ## PR stack: - -> #165978 : implementation of XPU scaled_mm and oneDNN kernel - #167518 : implementation of XPU scaled_mm_v2 - #166056 : Op registration ## Test Status: 1. Relies on the changes in intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU. 2. This PR does not include tests, the tests are enabled in #166056. ## Credit: This work is based on @yuchengliu1's work at #140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts. ## FP8 Task tracker: We will track all the scaled_mm related tasks in: #167170 Pull Request resolved: #165978 Approved by: https://github.com/liangan1, https://github.com/EikanWang Co-authored-by: Eikan Wang <eikan.wang@intel.com>
a632b00 to
714da47
Compare
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
dc90b44 to
8be6ac3
Compare
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
8be6ac3 to
b7e556b
Compare
|
@pytorchbot label "ciflow/xpu" |
|
To add these label(s) (ciflow/xpu) to the PR, please first approve the workflows that are awaiting approval (scroll to the bottom of this page). This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
@pytorchbot merge |
|
Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra. |
|
The failed CI is because of timeout. This PR does not modify the op nor modify the test, thus the failure should not be related |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…h#165978) This PR implements `scaled_mm` for XPU. It enables the following data types: 1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2` 2. RowWise Scaling: `fp8_e4m3` and `fp8_e5m2` It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts. This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts. Secondly, there is a `scaled_mm_v2` API in pytorch#164141 . We will align with it once the v1 is cleaned up. **Co-author:** @yuchengliu1, @carsonwang ## PR stack: - -> pytorch#165978 : implementation of XPU scaled_mm and oneDNN kernel - pytorch#167518 : implementation of XPU scaled_mm_v2 - pytorch#166056 : Op registration ## Test Status: 1. Relies on the changes in intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU. 2. This PR does not include tests, the tests are enabled in pytorch#166056. ## Credit: This work is based on @yuchengliu1's work at pytorch#140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts. ## FP8 Task tracker: We will track all the scaled_mm related tasks in: pytorch#167170 Pull Request resolved: pytorch#165978 Approved by: https://github.com/liangan1, https://github.com/EikanWang Co-authored-by: Eikan Wang <eikan.wang@intel.com>
…xpu (#166056) This PR registers the `scaled_mm` op for XPU support. It does the following: 1. Registered the `_scaled_mm` and `_scaled_mm_v2` op for XPU. 2. Enables XPU tests in `test_scaled_matmul_cuda.py`. 3. Update torch-xpu-ops pin to remove fallback `scaled_mm` to CPU implementation. ## PR Stack: - #165978 : implementation of XPU scaled_mm and oneDNN kernel - #167518 : implementation of XPU scaled_mm_v2 - -> #166056 : Op registration ## Task tracker: We will track all the scaled_mm related tasks in: #167170 Pull Request resolved: #166056 Approved by: https://github.com/EikanWang, https://github.com/slayton58, https://github.com/drisspg
…xpu (#166056) This PR registers the `scaled_mm` op for XPU support. It does the following: 1. Registered the `_scaled_mm` and `_scaled_mm_v2` op for XPU. 2. Enables XPU tests in `test_scaled_matmul_cuda.py`. 3. Update torch-xpu-ops pin to remove fallback `scaled_mm` to CPU implementation. ## PR Stack: - #165978 : implementation of XPU scaled_mm and oneDNN kernel - #167518 : implementation of XPU scaled_mm_v2 - -> #166056 : Op registration ## Task tracker: We will track all the scaled_mm related tasks in: #167170 Pull Request resolved: #166056 Approved by: https://github.com/EikanWang, https://github.com/slayton58, https://github.com/drisspg
This PR implements
scaled_mm_v2for XPU follows the work in #164141 .PR stack:
scaled_mmandscaled_mm_v2for xpu #166056 : Op registrationcc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @gujinghui @EikanWang @fengyuan14 @guangyey