Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140972
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 3 Unrelated FailuresAs of commit 44bbed8 with merge base 09e5a93 ( NEW FAILURE - The following job has failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
981a699 to
808bc23
Compare
| engine, | ||
| mat1.data_ptr()); | ||
| auto mat2_c = mat2.contiguous(); | ||
| dnnl::memory weight = at::native::onednn::make_onednn_memory( |
There was a problem hiding this comment.
Stride inputs is not supported by oneDNN?
There was a problem hiding this comment.
yes, delete the line 280, and use the tensor stride in onednn memory desc.
| if (with_bias) { | ||
| args.insert({DNNL_ARG_BIAS, onednn_bias}); | ||
| } | ||
| // auto sycl_queue = dnnl::sycl_interop::get_queue(stream); |
There was a problem hiding this comment.
Suggest to remove this line.
| at::native::resize_output(out, {mat1.size(0), mat2.size(1)}); | ||
| onednn::scaled_matmul( | ||
| out, mat1, mat2, bias, scale_a, scale_b, onednn::Attr()); | ||
| return out; |
There was a problem hiding this comment.
The torch._scaled_mm support out_dtype and oneDNN also support different output dtype https://oneapi-src.github.io/oneDNN/dev_guide_matmul.html#data-types, suggest to to add the related functionality.
There was a problem hiding this comment.
the function only check the input out_dtype is same to the dtype of out tensor. It has no restrictions on out_dtype
058967a to
afe403d
Compare
There was a problem hiding this comment.
What's the meaning of current platform? It is related to CPU, GPU, os, or oneDNN version? It is about hardware or software?
There was a problem hiding this comment.
This try/catch is to avoid onednn unimplemented the problem config. It may because onednn have not implemented yet or the hardware is not support.
There was a problem hiding this comment.
Based on your implementation, my understanding is that removing this try-catch block might cause oneDNN to crash on certain GPUs—possibly on integrated GPUs? Is that correct?
Could oneDNN confirm which version will provide an API to query to support FP8 primitives?
test/xpu/test_scaled_mm.py
Outdated
There was a problem hiding this comment.
nit: we should not import torchao here, as torchao has a dependency on PyTorch. All these functions in torchao are also not in the BC surface so we really should not use them in other repos.
There was a problem hiding this comment.
removed torchao in test
|
@yuchengliu1 , could you pls. help investigate the failures? |
Yes, there has been already a PR in torch-xpu-ops. |
Most of these failures due to the fallback to CPU. The CI run CPU scaled_mm actually. We need find a way to remove the fallback firstly and then trigger a CI to test the XPU scaled_mm. |
guangyey
left a comment
There was a problem hiding this comment.
Just nit in try-catch handle, otherwise, LGTM.
Add @ZhiweiYan-96 in case you have some comments here
| if (scale_a.numel() == 1) { | ||
| op_attr.set_scales_mask(DNNL_ARG_SRC, 0); | ||
| } else { | ||
| // onednn 3.7 not support per token src scale, so use post mul work around |
There was a problem hiding this comment.
Currently, XPU uses onednn 3.8. Do you know which version onednn will support this feature?
There was a problem hiding this comment.
current onednn does not support this feature
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Rebase failed due to Command Raised by https://github.com/pytorch/pytorch/actions/runs/16930825608 |
| auto& engine = GpuEngineManager::Instance().get_engine(); | ||
| auto& stream = GpuStreamManager::Instance().get_stream(); | ||
|
|
||
| // Validation checks have passed lets resize the output to actual size |
There was a problem hiding this comment.
This comment actually is for
at::native::resize_output(out, {mat1.size(0), mat2.size(1)});
at Blas.cpp?
There was a problem hiding this comment.
Sorry, I don't get your meaning. Could you explain it in more detail?
There was a problem hiding this comment.
Oh, sorry for not comment clearly. I mean, the comment of:
"validation checks have passed", this I suppose should be checks of the TORCH_CHECK(...) in the beginning of the _scaled_mm_out_xpu.
"resize the output to actual size": This should mean the resize_output(out, ...) in line 444 of Blas.cpp. Because formerly, this out is initialized as an empty 0 tensor (line 459).
So from my understanding, this comment should indicate the line 444, rather than this oneDNN size creation. But it is not a big deal anyway.
There was a problem hiding this comment.
The function is not only used in _scaled_mm_xpu. It can be called like torch._scaled_mm(a,b,...,out). out is an external variables here, and it may need to be resized.
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
This PR implements `scaled_mm` for XPU. It enables the following data types: 1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2` 2. RowWise Scaling: `fp8_e4m3` and `fp8_e5m2` It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts. This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts. Secondly, there is a `scaled_mm_v2` API in #164141 . We will align with it once the v1 is cleaned up. **Co-author:** @yuchengliu1, @carsonwang ## PR stack: - -> #165978 : implementation of XPU scaled_mm and oneDNN kernel - #167518 : implementation of XPU scaled_mm_v2 - #166056 : Op registration ## Test Status: 1. Relies on the changes in intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU. 2. This PR does not include tests, the tests are enabled in #166056. ## Credit: This work is based on @yuchengliu1's work at #140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts. ## FP8 Task tracker: We will track all the scaled_mm related tasks in: #167170 Pull Request resolved: #165978 Approved by: https://github.com/liangan1, https://github.com/EikanWang Co-authored-by: Eikan Wang <eikan.wang@intel.com>
…h#165978) This PR implements `scaled_mm` for XPU. It enables the following data types: 1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2` 2. RowWise Scaling: `fp8_e4m3` and `fp8_e5m2` It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts. This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts. Secondly, there is a `scaled_mm_v2` API in pytorch#164141 . We will align with it once the v1 is cleaned up. **Co-author:** @yuchengliu1, @carsonwang ## PR stack: - -> pytorch#165978 : implementation of XPU scaled_mm and oneDNN kernel - pytorch#167518 : implementation of XPU scaled_mm_v2 - pytorch#166056 : Op registration ## Test Status: 1. Relies on the changes in intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU. 2. This PR does not include tests, the tests are enabled in pytorch#166056. ## Credit: This work is based on @yuchengliu1's work at pytorch#140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts. ## FP8 Task tracker: We will track all the scaled_mm related tasks in: pytorch#167170 Pull Request resolved: pytorch#165978 Approved by: https://github.com/liangan1, https://github.com/EikanWang Co-authored-by: Eikan Wang <eikan.wang@intel.com>
scaled_mm has supported in pytorch pytorch/pytorch#140972 This fallback will cause duplicate registration. remove this fallback after pytorch/pytorch#140972 merged --------- Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: Su Tong <tong.su@intel.com>
This PR introduce fp8 scaled_mm for intel XPU. The UT
test\xpu\test_scaled_mm.pyis a subset oftest\test_matmul_cuda.py.torch-xpu-opsfallback scaled_mm now. So we will get a warning. This warning has not influence on this op.There will be a PR to remove the fallback in
torch-xpu-opsafter this PR merged.cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @gujinghui @PenghuiCheng @jianyuh @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @snadampal @voznesenskym @penguinwu @EikanWang @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @kwen2501 @c-p-i-o @yf225 @ColinPeppler @desertfire