[WOQ] Add XPU kernel for _weight_int8pack_mm#160938
[WOQ] Add XPU kernel for _weight_int8pack_mm#160938xiaowangintel wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160938
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit bcdb3c0 with merge base 90f50f7 ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Attention! native_functions.yaml was changedIf you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info. Caused by: |
Attention! PyTorch one of the C-stable API file was changedYou MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function. Caused by: |
|
@liangan1 @ZhiweiYan-96 @guangyey @EikanWang Please help to review this pr. |
| TORCH_CHECK( | ||
| A.dtype() == kBFloat16 || A.dtype() == kHalf || A.dtype() == kFloat, | ||
| __func__, | ||
| " : expect A to be either 32-bit or 16-bit float tensor."); | ||
| TORCH_CHECK(A.dim() == 2, __func__, " : expect A to be 2D tensor."); | ||
| TORCH_CHECK( | ||
| A.stride(1) == 1, | ||
| __func__, | ||
| " : A must be contiguous on the last dimension."); | ||
| TORCH_CHECK(B.dtype() == kChar, __func__, " : expect B to be int8 tensor."); | ||
| TORCH_CHECK(B.is_contiguous(), __func__, " : expect B to be contiguous."); | ||
| TORCH_CHECK(B.size(1) == K, __func__, " : expect B.size(1) == ", K); |
There was a problem hiding this comment.
__func__ has been included in TORCH_CHECK. So, __func__ could be removed here.
| A.contiguous(), | ||
| 1.0, | ||
| 0, | ||
| B.contiguous(), |
There was a problem hiding this comment.
| B.contiguous(), | |
| B |
| // --- Launch kernel --- | ||
| Tensor bias = at::Tensor(); | ||
| Tensor mat2_zero_points = at::Tensor(); | ||
| Tensor non_const_scales = scales; |
There was a problem hiding this comment.
Since there is no more operation on non_const_scales, why not use scales directly?
There was a problem hiding this comment.
The quantized_matmul receive weight scales as lvalue reference. However, scales is const Tensor&, and cause C++ compilation errors.
|
Generally LGTM. |
| " : expect scales to be 1d tensor with size ", | ||
| N); | ||
|
|
||
| auto C = at::empty({M, N}, A.options()); |
There was a problem hiding this comment.
Is it possible to invoke native::empty directly?
|
@jerryzh168 can you help to review this PR? |
|
@xiaowangintel pls rebase the code and fix the CI issue. |
9df7c03 to
74737aa
Compare
74737aa to
faca71e
Compare
|
@drisspg can you help to review this PR? |
|
@jerryzh168 can you help to review this PR? |
jerryzh168
left a comment
There was a problem hiding this comment.
can't provide meaningful reviews as I'm not familiar with hardware details, but can stamp.
also should this op live in torchao in the end?
Thanks Jerry. Yes. This op will be used to speedup the WOQ-INT8 in the torchAO. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA. Motivation: Same as pytorch#159325. Pull Request resolved: pytorch#160938 Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168
Summary: This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA. Motivation: Same as pytorch#159325. Pull Request resolved: pytorch#160938 Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168
Summary: This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA. Motivation: Same as pytorch#159325. Pull Request resolved: pytorch#160938 Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168
Summary: Supports woq_int8 inductor pattern on Intel GPU. When using torch.compile, woq_int8 will be lowering to _weight_int8pack_mm instead of being falled back mul().sum(). The Intel GPU backend of _weight_int8pack_mm was supported in #160938. Pull Request resolved: #163615 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/jansel
Summary: Supports woq_int8 inductor pattern on Intel GPU. When using torch.compile, woq_int8 will be lowering to _weight_int8pack_mm instead of being falled back mul().sum(). The Intel GPU backend of _weight_int8pack_mm was supported in #160938. Pull Request resolved: #163615 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/jansel
…ch#163615) Summary: Supports woq_int8 inductor pattern on Intel GPU. When using torch.compile, woq_int8 will be lowering to _weight_int8pack_mm instead of being falled back mul().sum(). The Intel GPU backend of _weight_int8pack_mm was supported in pytorch#160938. Pull Request resolved: pytorch#163615 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/jansel
Summary:
This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA.
Motivation:
Same as #159325.
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168