Skip to content

[aarch64] add sbgemm inner product op - cherrypick of pr#1768#1831

Merged
vpirogov merged 3 commits intouxlfoundation:rls-v3.3from
snadampal:cherrypick_pr1768
Mar 13, 2024
Merged

[aarch64] add sbgemm inner product op - cherrypick of pr#1768#1831
vpirogov merged 3 commits intouxlfoundation:rls-v3.3from
snadampal:cherrypick_pr1768

Conversation

@snadampal
Copy link
Copy Markdown
Contributor

Description

Added sbgemm inner product op and blocked weights support to enable PyTorch torch.compile() and bf16 fastmath kernels to work together on aarch64.

Please include a summary of the change. Please also include relevant motivation and context. See contribution guidelines for more details. If the change fixes an issue not documented in the project's Github issue tracker, please document all steps necessary to reproduce it.

Fixes # (github issue)

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
    tested make test and the inner product primitive tests (make test_bench_ip_ci).
  • [ x] Have you formatted the code using clang-format?

Performance improvements

  • Have you submitted performance data that demonstrates performance improvements?

New features

  • Have you published an RFC for the new feature?
  • Was the RFC approved?
  • Have you added relevant tests?

Bug fixes

  • Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
  • Have you added relevant regression tests?

RFC PR

  • Does RFC document follow the template?
  • Have you added a link to the rendered document?

with weights pre-packing enabled in torch.compile(),
the weights come already reorderd and in oneDNN format,
so, allowing format_kind::blocked as one of the supported
formats for acl inner product primitive.
@vpirogov
Copy link
Copy Markdown
Contributor

+@cfRod, @jondea please review

@snadampal, do you have tests passing locally?

@vpirogov
Copy link
Copy Markdown
Contributor

I'm mainly interested in tests on Arm processors of course :)

@snadampal
Copy link
Copy Markdown
Contributor Author

Yes, @vpirogov , I tested
make test
make test_bench_ip_ci and
./benchdnn --ip --mode=P --engine=cpu --allow-enum-tags-only=0 --batch=inputs/ip/test_ip_acl

@igorsafo
Copy link
Copy Markdown
Contributor

igorsafo commented Mar 13, 2024

Yes, @vpirogov , I tested make test make test_bench_ip_ci and ./benchdnn --ip --mode=P --engine=cpu --allow-enum-tags-only=0 --batch=inputs/ip/test_ip_acl

These tests do not cover mixed precision cases (f32:bf16). Could you please add it as well? Please make sure they do not fail on the platforms where mixed precision is not supported by ACL.

@snadampal
Copy link
Copy Markdown
Contributor Author

Hi @igorsafo , can you please point me to which tests you were referring to?

@igorsafo
Copy link
Copy Markdown
Contributor

igorsafo commented Mar 13, 2024

Hi @igorsafo , can you please point me to which tests you were referring to?

Hi @snadampal !
Currently none of test_ip_ci and test_ip_acl cover a case when activation is f32 and weights bf16 (f32:bf16:bf16 or f32:bf16:f32). I am suggesting to add it into test_ip_acl to make sure it is validated for ACL. Another option is to run tests/benchdnn/inputs/ip/test_ip_bfloat16 on Aarch64 with ACL enabled, because it does have such cases:

$ grep -r "f32:bf16" tests/benchdnn/inputs/ip/
tests/benchdnn/inputs/ip/test_ip_bfloat16:--dt=bf16,f32:bf16:bf16
tests/benchdnn/inputs/ip/test_ip_bfloat16:--dt=bf16,bf16:f32:bf16

UPDATE: It is a trapbackport! Please ignore the comments about additional changes, such changes should be created into main for the future releases and not into the backport. The changes look good to me.

@snadampal
Copy link
Copy Markdown
Contributor Author

thanks, @igorsafo , i will look into adding those additional test cases to main.

@vpirogov vpirogov merged commit e7abee2 into uxlfoundation:rls-v3.3 Mar 13, 2024
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Mar 28, 2024
As the title. Including issue fixes for aarch64:
- uxlfoundation/oneDNN#1831
- uxlfoundation/oneDNN#1834

---

## Validation results
(on Intel CPU + Linux)
**Static quantization with Inductor on CV models**

Quant method | Geomean throughput ratio (v3.3.6/baseline)
-- | --
ptq | 0.982937
ptq (cpp wrapper) | 0.978384
qat | 0.978828

**Torchbench cpu userbenchmark with Inductor**

Items | Perf Geomean Ratio (v3.3.6/baseline)
-- | --
eager_throughtput_bf16_infer | 1.00x
eager_throughtput_fp32_infer | 1.00x
jit_llga_throughtput_amp_bf16 | 1.01x
jit_llga_throughtput_fp32 | 1.00x
eager_throughtput_fx_int8 | 1.00x
eager_throughtput_bf16_train | 1.46x
eager_throughtput_fp32_train | 1.41x

**Dynamo benchmarks tests**
Precision | Shape | Wrapper | Thread | Eager old/new GEOMEAN | Inductor old/new GEOMEAN
-- | -- | -- | -- | -- | --
Float32 | Static | Default | Multiple | 1.003836812 | 1.003425
Float32 | Static | Default | Single | 1.000181451 | 0.999611
Float32 | Dynamic | Default | Multiple | 1.003980183 | 1.006563
Float32 | Dynamic | Default | Single | 1.000076939 | 0.999969
AMP | Static | Default | Multiple | 0.996824772 | 0.998715
AMP | Static | Default | Single | 0.996402574 | 1.001483
AMP | Dynamic | Default | Multiple | 0.994919866 | 1.000467
AMP | Dynamic | Default | Single | 0.9962054 | 1.000767

(on Aarch64)
#122164 (comment)

---

Pull Request resolved: #122164
Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman
Xia-Weiwen added a commit to Xia-Weiwen/pytorch that referenced this pull request Mar 29, 2024
As the title. Including issue fixes for aarch64:
- uxlfoundation/oneDNN#1831
- uxlfoundation/oneDNN#1834

---

## Validation results
(on Intel CPU + Linux)
**Static quantization with Inductor on CV models**

Quant method | Geomean throughput ratio (v3.3.6/baseline)
-- | --
ptq | 0.982937
ptq (cpp wrapper) | 0.978384
qat | 0.978828

**Torchbench cpu userbenchmark with Inductor**

Items | Perf Geomean Ratio (v3.3.6/baseline)
-- | --
eager_throughtput_bf16_infer | 1.00x
eager_throughtput_fp32_infer | 1.00x
jit_llga_throughtput_amp_bf16 | 1.01x
jit_llga_throughtput_fp32 | 1.00x
eager_throughtput_fx_int8 | 1.00x
eager_throughtput_bf16_train | 1.46x
eager_throughtput_fp32_train | 1.41x

**Dynamo benchmarks tests**
Precision | Shape | Wrapper | Thread | Eager old/new GEOMEAN | Inductor old/new GEOMEAN
-- | -- | -- | -- | -- | --
Float32 | Static | Default | Multiple | 1.003836812 | 1.003425
Float32 | Static | Default | Single | 1.000181451 | 0.999611
Float32 | Dynamic | Default | Multiple | 1.003980183 | 1.006563
Float32 | Dynamic | Default | Single | 1.000076939 | 0.999969
AMP | Static | Default | Multiple | 0.996824772 | 0.998715
AMP | Static | Default | Single | 0.996402574 | 1.001483
AMP | Dynamic | Default | Multiple | 0.994919866 | 1.000467
AMP | Dynamic | Default | Single | 0.9962054 | 1.000767

(on Aarch64)
pytorch#122164 (comment)

---

Pull Request resolved: pytorch#122164
Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman
atalman pushed a commit to pytorch/pytorch that referenced this pull request Apr 2, 2024
As the title. Including issue fixes for aarch64:
- uxlfoundation/oneDNN#1831
- uxlfoundation/oneDNN#1834

---

## Validation results
(on Intel CPU + Linux)
**Static quantization with Inductor on CV models**

Quant method | Geomean throughput ratio (v3.3.6/baseline)
-- | --
ptq | 0.982937
ptq (cpp wrapper) | 0.978384
qat | 0.978828

**Torchbench cpu userbenchmark with Inductor**

Items | Perf Geomean Ratio (v3.3.6/baseline)
-- | --
eager_throughtput_bf16_infer | 1.00x
eager_throughtput_fp32_infer | 1.00x
jit_llga_throughtput_amp_bf16 | 1.01x
jit_llga_throughtput_fp32 | 1.00x
eager_throughtput_fx_int8 | 1.00x
eager_throughtput_bf16_train | 1.46x
eager_throughtput_fp32_train | 1.41x

**Dynamo benchmarks tests**
Precision | Shape | Wrapper | Thread | Eager old/new GEOMEAN | Inductor old/new GEOMEAN
-- | -- | -- | -- | -- | --
Float32 | Static | Default | Multiple | 1.003836812 | 1.003425
Float32 | Static | Default | Single | 1.000181451 | 0.999611
Float32 | Dynamic | Default | Multiple | 1.003980183 | 1.006563
Float32 | Dynamic | Default | Single | 1.000076939 | 0.999969
AMP | Static | Default | Multiple | 0.996824772 | 0.998715
AMP | Static | Default | Single | 0.996402574 | 1.001483
AMP | Dynamic | Default | Multiple | 0.994919866 | 1.000467
AMP | Dynamic | Default | Single | 0.9962054 | 1.000767

(on Aarch64)
#122164 (comment)

---

Pull Request resolved: #122164
Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
As the title. Including issue fixes for aarch64:
- uxlfoundation/oneDNN#1831
- uxlfoundation/oneDNN#1834

---

## Validation results
(on Intel CPU + Linux)
**Static quantization with Inductor on CV models**

Quant method | Geomean throughput ratio (v3.3.6/baseline)
-- | --
ptq | 0.982937
ptq (cpp wrapper) | 0.978384
qat | 0.978828

**Torchbench cpu userbenchmark with Inductor**

Items | Perf Geomean Ratio (v3.3.6/baseline)
-- | --
eager_throughtput_bf16_infer | 1.00x
eager_throughtput_fp32_infer | 1.00x
jit_llga_throughtput_amp_bf16 | 1.01x
jit_llga_throughtput_fp32 | 1.00x
eager_throughtput_fx_int8 | 1.00x
eager_throughtput_bf16_train | 1.46x
eager_throughtput_fp32_train | 1.41x

**Dynamo benchmarks tests**
Precision | Shape | Wrapper | Thread | Eager old/new GEOMEAN | Inductor old/new GEOMEAN
-- | -- | -- | -- | -- | --
Float32 | Static | Default | Multiple | 1.003836812 | 1.003425
Float32 | Static | Default | Single | 1.000181451 | 0.999611
Float32 | Dynamic | Default | Multiple | 1.003980183 | 1.006563
Float32 | Dynamic | Default | Single | 1.000076939 | 0.999969
AMP | Static | Default | Multiple | 0.996824772 | 0.998715
AMP | Static | Default | Single | 0.996402574 | 1.001483
AMP | Dynamic | Default | Multiple | 0.994919866 | 1.000467
AMP | Dynamic | Default | Single | 0.9962054 | 1.000767

(on Aarch64)
pytorch#122164 (comment)

---

Pull Request resolved: pytorch#122164
Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants