aarch64 linux: torch performance is regressed for openblas backend from torch 2.0 to 2.1+

### 🐛 Describe the bug

on aarch64 linux platform, PyTorch inference latencies are increased on torch 2.1 and 2.2 compared to torch2.0 when openblas backend is used for multi-threaded configuration. The regression is higher for larger thread counts.

On AWS Graviton3, c7g.4xl, with 16 threads, the inference latency with torch2.0 is
`Time elapsed: 2.777902126312256 seconds`

whereas with torch 2.1 and later, it is
`Time elapsed: 4.907686471939087 seconds`

**Reproducer:**

```
pip3 install torch
pip3 install transformers
OMP_NUM_THREADS=16 python3 gpt2-large.py
```

gpt2-large.py
```
import time

from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='gpt2-large')
set_seed(42)

start_time = time.time()

generator("Hello, I'm a language model", max_length=40, num_return_sequences=1)

end_time = time.time()

elapsed_time = end_time - start_time
print(f"Time elapsed: {elapsed_time} seconds")
```






### Versions

aarch64 linux, Ubuntu 22.04, torch 2.1 or later

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aarch64 linux: torch performance is regressed for openblas backend from torch 2.0 to 2.1+ #119374

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

aarch64 linux: torch performance is regressed for openblas backend from torch 2.0 to 2.1+ #119374

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions