Skip to content

[ATen] mean operator is unvectorized on CPU #16617

@jamesr66a

Description

@jamesr66a

Benchmark script:

import torch
import time

B, nmodel, emb = 6, 4, 2400
word_size = 4
a = torch.rand(B, nmodel, emb)

NITER = 100

s = time.time()
for i in range(NITER):
    torch.mean(a, dim=1)
elapsed = (time.time() - s) / NITER
print('Time per iteration', elapsed * 1000, 'ms/iter')
print('memory bandwidth', (B*nmodel*emb + B*emb) * word_size / elapsed / 1e9, 'GB/s')

Today this yields (on a Skylake CPU, as well as other machines we've tried):

Time per iteration 35.26966094970703 ms/iter
memory bandwidth 0.00816565830929521 GB/s

Which is well below the full utilization of the chip. Upon further inspection we see that the mean operator implementation uses the unvectorized reduction implementation, compared to the sum implementation directly above it that uses the vectorized implementation:

In an upcoming PR (#16617), I am working around this mean implementation to rather call into sum() and do a division by a scalar, which yields the following performance with the same benchmark script:

Time per iteration 0.5928611755371094 ms/iter
memory bandwidth 0.485779828201911 GB/s

We should make the actual mean implementation vectorized as a long term solution here.

Metadata

Metadata

Assignees

Labels

module: reductionstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions