[aarch64] add sbgemm inner product op - cherrypick of pr#1768#1831
[aarch64] add sbgemm inner product op - cherrypick of pr#1768#1831vpirogov merged 3 commits intouxlfoundation:rls-v3.3from
Conversation
with weights pre-packing enabled in torch.compile(), the weights come already reorderd and in oneDNN format, so, allowing format_kind::blocked as one of the supported formats for acl inner product primitive.
|
+@cfRod, @jondea please review @snadampal, do you have tests passing locally? |
|
I'm mainly interested in tests on Arm processors of course :) |
|
Yes, @vpirogov , I tested |
These tests do not cover mixed precision cases ( |
|
Hi @igorsafo , can you please point me to which tests you were referring to? |
Hi @snadampal ! UPDATE: It is a |
|
thanks, @igorsafo , i will look into adding those additional test cases to main. |
As the title. Including issue fixes for aarch64: - uxlfoundation/oneDNN#1831 - uxlfoundation/oneDNN#1834 --- ## Validation results (on Intel CPU + Linux) **Static quantization with Inductor on CV models** Quant method | Geomean throughput ratio (v3.3.6/baseline) -- | -- ptq | 0.982937 ptq (cpp wrapper) | 0.978384 qat | 0.978828 **Torchbench cpu userbenchmark with Inductor** Items | Perf Geomean Ratio (v3.3.6/baseline) -- | -- eager_throughtput_bf16_infer | 1.00x eager_throughtput_fp32_infer | 1.00x jit_llga_throughtput_amp_bf16 | 1.01x jit_llga_throughtput_fp32 | 1.00x eager_throughtput_fx_int8 | 1.00x eager_throughtput_bf16_train | 1.46x eager_throughtput_fp32_train | 1.41x **Dynamo benchmarks tests** Precision | Shape | Wrapper | Thread | Eager old/new GEOMEAN | Inductor old/new GEOMEAN -- | -- | -- | -- | -- | -- Float32 | Static | Default | Multiple | 1.003836812 | 1.003425 Float32 | Static | Default | Single | 1.000181451 | 0.999611 Float32 | Dynamic | Default | Multiple | 1.003980183 | 1.006563 Float32 | Dynamic | Default | Single | 1.000076939 | 0.999969 AMP | Static | Default | Multiple | 0.996824772 | 0.998715 AMP | Static | Default | Single | 0.996402574 | 1.001483 AMP | Dynamic | Default | Multiple | 0.994919866 | 1.000467 AMP | Dynamic | Default | Single | 0.9962054 | 1.000767 (on Aarch64) #122164 (comment) --- Pull Request resolved: #122164 Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman
As the title. Including issue fixes for aarch64: - uxlfoundation/oneDNN#1831 - uxlfoundation/oneDNN#1834 --- ## Validation results (on Intel CPU + Linux) **Static quantization with Inductor on CV models** Quant method | Geomean throughput ratio (v3.3.6/baseline) -- | -- ptq | 0.982937 ptq (cpp wrapper) | 0.978384 qat | 0.978828 **Torchbench cpu userbenchmark with Inductor** Items | Perf Geomean Ratio (v3.3.6/baseline) -- | -- eager_throughtput_bf16_infer | 1.00x eager_throughtput_fp32_infer | 1.00x jit_llga_throughtput_amp_bf16 | 1.01x jit_llga_throughtput_fp32 | 1.00x eager_throughtput_fx_int8 | 1.00x eager_throughtput_bf16_train | 1.46x eager_throughtput_fp32_train | 1.41x **Dynamo benchmarks tests** Precision | Shape | Wrapper | Thread | Eager old/new GEOMEAN | Inductor old/new GEOMEAN -- | -- | -- | -- | -- | -- Float32 | Static | Default | Multiple | 1.003836812 | 1.003425 Float32 | Static | Default | Single | 1.000181451 | 0.999611 Float32 | Dynamic | Default | Multiple | 1.003980183 | 1.006563 Float32 | Dynamic | Default | Single | 1.000076939 | 0.999969 AMP | Static | Default | Multiple | 0.996824772 | 0.998715 AMP | Static | Default | Single | 0.996402574 | 1.001483 AMP | Dynamic | Default | Multiple | 0.994919866 | 1.000467 AMP | Dynamic | Default | Single | 0.9962054 | 1.000767 (on Aarch64) pytorch#122164 (comment) --- Pull Request resolved: pytorch#122164 Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman
As the title. Including issue fixes for aarch64: - uxlfoundation/oneDNN#1831 - uxlfoundation/oneDNN#1834 --- ## Validation results (on Intel CPU + Linux) **Static quantization with Inductor on CV models** Quant method | Geomean throughput ratio (v3.3.6/baseline) -- | -- ptq | 0.982937 ptq (cpp wrapper) | 0.978384 qat | 0.978828 **Torchbench cpu userbenchmark with Inductor** Items | Perf Geomean Ratio (v3.3.6/baseline) -- | -- eager_throughtput_bf16_infer | 1.00x eager_throughtput_fp32_infer | 1.00x jit_llga_throughtput_amp_bf16 | 1.01x jit_llga_throughtput_fp32 | 1.00x eager_throughtput_fx_int8 | 1.00x eager_throughtput_bf16_train | 1.46x eager_throughtput_fp32_train | 1.41x **Dynamo benchmarks tests** Precision | Shape | Wrapper | Thread | Eager old/new GEOMEAN | Inductor old/new GEOMEAN -- | -- | -- | -- | -- | -- Float32 | Static | Default | Multiple | 1.003836812 | 1.003425 Float32 | Static | Default | Single | 1.000181451 | 0.999611 Float32 | Dynamic | Default | Multiple | 1.003980183 | 1.006563 Float32 | Dynamic | Default | Single | 1.000076939 | 0.999969 AMP | Static | Default | Multiple | 0.996824772 | 0.998715 AMP | Static | Default | Single | 0.996402574 | 1.001483 AMP | Dynamic | Default | Multiple | 0.994919866 | 1.000467 AMP | Dynamic | Default | Single | 0.9962054 | 1.000767 (on Aarch64) #122164 (comment) --- Pull Request resolved: #122164 Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman
As the title. Including issue fixes for aarch64: - uxlfoundation/oneDNN#1831 - uxlfoundation/oneDNN#1834 --- ## Validation results (on Intel CPU + Linux) **Static quantization with Inductor on CV models** Quant method | Geomean throughput ratio (v3.3.6/baseline) -- | -- ptq | 0.982937 ptq (cpp wrapper) | 0.978384 qat | 0.978828 **Torchbench cpu userbenchmark with Inductor** Items | Perf Geomean Ratio (v3.3.6/baseline) -- | -- eager_throughtput_bf16_infer | 1.00x eager_throughtput_fp32_infer | 1.00x jit_llga_throughtput_amp_bf16 | 1.01x jit_llga_throughtput_fp32 | 1.00x eager_throughtput_fx_int8 | 1.00x eager_throughtput_bf16_train | 1.46x eager_throughtput_fp32_train | 1.41x **Dynamo benchmarks tests** Precision | Shape | Wrapper | Thread | Eager old/new GEOMEAN | Inductor old/new GEOMEAN -- | -- | -- | -- | -- | -- Float32 | Static | Default | Multiple | 1.003836812 | 1.003425 Float32 | Static | Default | Single | 1.000181451 | 0.999611 Float32 | Dynamic | Default | Multiple | 1.003980183 | 1.006563 Float32 | Dynamic | Default | Single | 1.000076939 | 0.999969 AMP | Static | Default | Multiple | 0.996824772 | 0.998715 AMP | Static | Default | Single | 0.996402574 | 1.001483 AMP | Dynamic | Default | Multiple | 0.994919866 | 1.000467 AMP | Dynamic | Default | Single | 0.9962054 | 1.000767 (on Aarch64) pytorch#122164 (comment) --- Pull Request resolved: pytorch#122164 Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman
Description
Added sbgemm inner product op and blocked weights support to enable PyTorch torch.compile() and bf16 fastmath kernels to work together on aarch64.
Please include a summary of the change. Please also include relevant motivation and context. See contribution guidelines for more details. If the change fixes an issue not documented in the project's Github issue tracker, please document all steps necessary to reproduce it.
Fixes # (github issue)
Checklist
General
make testandmake test_benchdnn_*) pass locally for each commit?tested make test and the inner product primitive tests (make test_bench_ip_ci).
Performance improvements
New features
Bug fixes
RFC PR