[ATen] mean operator is unvectorized on CPU

Benchmark script:

```
import torch
import time

B, nmodel, emb = 6, 4, 2400
word_size = 4
a = torch.rand(B, nmodel, emb)

NITER = 100

s = time.time()
for i in range(NITER):
    torch.mean(a, dim=1)
elapsed = (time.time() - s) / NITER
print('Time per iteration', elapsed * 1000, 'ms/iter')
print('memory bandwidth', (B*nmodel*emb + B*emb) * word_size / elapsed / 1e9, 'GB/s')

```

Today this yields (on a Skylake CPU, as well as other machines we've tried):
```
Time per iteration 35.26966094970703 ms/iter
memory bandwidth 0.00816565830929521 GB/s
```

Which is well below the full utilization of the chip. Upon further inspection we see that the mean operator implementation uses the unvectorized reduction implementation, compared to the sum implementation directly above it that uses the vectorized implementation:

https://github.com/pytorch/pytorch/blob/1ff864712bd7441b4ca2954b6bdd148e88506e39/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp#L29

In an upcoming PR (https://github.com/pytorch/pytorch/issues/16617), I am working around this mean implementation to rather call into sum() and do a division by a scalar, which yields the following performance with the same benchmark script:

```
Time per iteration 0.5928611755371094 ms/iter
memory bandwidth 0.485779828201911 GB/s
```

We should make the actual mean implementation vectorized as a long term solution here. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ATen] mean operator is unvectorized on CPU #16617

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ATen] mean operator is unvectorized on CPU #16617

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions