parallel norm operation for ATen on CPU#10535
Conversation
|
cc @colesbury, as I believe the TensorIterator reduction work can also be used for norm computation |
|
@xhzhao TensorIteration with reduction is not yet merged in master, but it will allow for more generic reductions on contiguous and non-contiguous tensors, over possibly multiple dimensions. I mentioned the TensorIterator to Sam so that we keep the |
|
Maybe we can review this PR first, and update the norm operation after the TensorIterator PR merged. |
| } | ||
|
|
||
| static scalar_t norm_calc(const scalar_t* data, int64_t n, int64_t stride, float pval) { | ||
| scalar_t result = 0.0; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Merged in #11565 |
optimize norm operation for ATen CPU path.
norm is a very heavy operation in RNN related workloads, see OpenNMT-py example.
Our profiling show that norm takes about 8% in OpenNMT-py training time, which is not acceptable.
currently, the code path from TH module runs in sequential on CPU, see link.
The norm performance data compare before and after our optimization: