-
Notifications
You must be signed in to change notification settings - Fork 27.2k
Closed
Labels
module: reductionstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
Benchmark script:
import torch
import time
B, nmodel, emb = 6, 4, 2400
word_size = 4
a = torch.rand(B, nmodel, emb)
NITER = 100
s = time.time()
for i in range(NITER):
torch.mean(a, dim=1)
elapsed = (time.time() - s) / NITER
print('Time per iteration', elapsed * 1000, 'ms/iter')
print('memory bandwidth', (B*nmodel*emb + B*emb) * word_size / elapsed / 1e9, 'GB/s')
Today this yields (on a Skylake CPU, as well as other machines we've tried):
Time per iteration 35.26966094970703 ms/iter
memory bandwidth 0.00816565830929521 GB/s
Which is well below the full utilization of the chip. Upon further inspection we see that the mean operator implementation uses the unvectorized reduction implementation, compared to the sum implementation directly above it that uses the vectorized implementation:
| binary_kernel_reduce( |
In an upcoming PR (#16617), I am working around this mean implementation to rather call into sum() and do a division by a scalar, which yields the following performance with the same benchmark script:
Time per iteration 0.5928611755371094 ms/iter
memory bandwidth 0.485779828201911 GB/s
We should make the actual mean implementation vectorized as a long term solution here.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
module: reductionstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module