🐛 Describe the bug
on aarch64 linux platform, PyTorch inference latencies are increased on torch 2.1 and 2.2 compared to torch2.0 when openblas backend is used for multi-threaded configuration. The regression is higher for larger thread counts.
On AWS Graviton3, c7g.4xl, with 16 threads, the inference latency with torch2.0 is
Time elapsed: 2.777902126312256 seconds
whereas with torch 2.1 and later, it is
Time elapsed: 4.907686471939087 seconds
Reproducer:
pip3 install torch
pip3 install transformers
OMP_NUM_THREADS=16 python3 gpt2-large.py
gpt2-large.py
import time
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2-large')
set_seed(42)
start_time = time.time()
generator("Hello, I'm a language model", max_length=40, num_return_sequences=1)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Time elapsed: {elapsed_time} seconds")
Versions
aarch64 linux, Ubuntu 22.04, torch 2.1 or later