You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Target 4096 blocks instead of split to large grid for large reduction (#35997)
Summary:
Pull Request resolved: #35997
When the number of blocks is large enough, we are already achieving
blalanced SM allocation. But we still should keep the number of inputs
per thread large, because thread reduce is cheap.
Benchmark for Half on V100:
https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb
On large tensor, it is: 1.37ms vs 1.25ms
Test Plan: Imported from OSS
Differential Revision: D20927533
Pulled By: ngimel
fbshipit-source-id: 40df52e439cc1c01cda66c6195b600f301c5e984
0 commit comments