Use fast integer division algorithm to avoid division ops inside kernels.#5054
Use fast integer division algorithm to avoid division ops inside kernels.#5054soumith merged 5 commits intopytorch:masterfrom
Conversation
- OffsetInfo and OffsetIterator pre-computes the necessary coordinate
change along each dimension, so that each successive offset can be
computed using only addition/subtraction/comparisons.
- Added IntDivider which supports "magic division" for uint32_t, thus
eliminating integer divisions altogether for offset calculation, as
long as indices fit in 32 bits.
- In code paths with statically determined dimensions (Dims=1 or 2),
kernel arguments now contain only the necessary data (instead of
MAX_CUTORCH_DIMS of everything).
- Fixed index overflow errors: for tensors with >= 2G elements, we used
to have incorrect results or an infinite loop inside the kernel.
TODO: The following pattern is broken for tensors with >= 2G elements.
It will result in overflow, even if IndexType is uint64_t. Need
to search and replace them.
> for (IndexType linearIndex = blockIdx.x * blockDim.x + threadIdx.x;
> linearIndex < totalElements;
> linearIndex += gridDim.x * blockDim.x) {
|
This PR improves some float operations by ~20% (and some operations on ByteTensor by up to ~45%), but in general the performance impact seems small, unless one uses a lot of non-contiguous tensors and/or broadcasting with large dimensions. Here's an example where I could get ~20% improvement on GTX 1080: I found at least one case where it becomes slower by ~5%, but such cases seem to be rare, so I still think it's a net performance win on average, although small. Raw benchmark results are https://github.com/yongjik/pt_test/tree/master/results/offset in case anybody's interested.
|
|
@pytorchbot add to whitelist |
wickedfoo
left a comment
There was a problem hiding this comment.
I'm not convinced that the code as you have it results in a performance win, and it makes the code a lot more complicated. 7.6 us to 6.0 us is within the realm of noise, and such changes are sensitive to heuristics used in the register allocator and in other places.
Replacing the linear index with a per-dimension index will bloat out the register count, and the code within the new iteration stuff looks like it has divergent/predicated execution paths as well.
However, I do believe that constant integer division via multiplication/shift by constants is worth trying. Your magic number division algorithm can be simplified by restricting its usage to the case 2 to max signed int (see comments).
Can you do a more minimal diff keeping the old kernel structure and the linear index -> offset lookup trying the faster version of the magic constant division algorithm, with a fallback to using normal integer div/mod if it falls outside the range under consideration?
For performance testing, I would concentrate on sufficiently large tensor sizes, say a large tensor (multi-100 MB+ in size) that is transposed on which you perform pointwise operations. A kernel that executes in just microseconds I think is likely to fall within the margin of noise.
Also I would inspect the SASS to see what instructions it was emitting before for integer div/mod (I believe it tries to map it to floating point inverse, when I recall looking a long time ago), and see what instructions it actually issues for umulhi as well.
| #ifdef __CUDA_ARCH__ | ||
| // 't' is the higher 32-bits of unsigned 32-bit multiplication of 'n' and | ||
| // 'm1'. | ||
| unsigned int t = __umulhi(n, m1); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| // 't' is the higher 32-bits of unsigned 32-bit multiplication of 'n' and | ||
| // 'm1'. | ||
| unsigned int t = __umulhi(n, m1); | ||
| unsigned int t2 = t + ((n - t) >> s1); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| { | ||
| bool carry = false; | ||
|
|
||
| for (int i = dims - 1; i > 0; --i) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| bool carry = false; | ||
|
|
||
| for (int i = dims - 1; i > 0; --i) { | ||
| IndexType index = indices[i] + increments[i] + (IndexType) carry; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| @@ -119,17 +120,18 @@ template <typename Op, | |||
| __launch_bounds__(THC_APPLY_THREADS_PER_BLOCK, THC_APPLY_BLOCKS_PER_SM) | |||
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| IndexType next = index + step; | ||
|
|
||
| // The second condition is necessary to handle overflow (e.g., when step is | ||
| // 2GB and limit is 3GB, assuming 32-bit index). |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| const OffsetInfo<Tb, IndexType, BDims> b, | ||
| IndexType totalElements, | ||
| Op op) { | ||
| for (IndexType linearIndex = blockIdx.x * blockDim.x + threadIdx.x; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Hi @wickedfoo, thanks for the detailed review, and I understand your point that the code is too complicated for the (rather unimpressive) speedup. I'll try just using the constant division algorithm and get back to you. Might take a few days. On the other hand, I do think there's a measurable speedup for some cases. One case I found: Ironically, using even larger tensor doesn't show larger speedup, because then (I suppose) memory bandwidth dominates everything. |
|
@yongjik also take a look at https://github.com/milakov/int_fastdiv |
|
Also the reason that IndexType was |
|
The integer division by magic constants code in the Caffe2 source I think will be faster than int_fastdiv if you exclude the -1 / 1 case. They're basically the same code more or less, except you avoid this additional work: https://github.com/milakov/int_fastdiv/blob/master/int_fastdiv.h#L126 |
|
@yongjik I suffered a lot tuning the |
- Also changed canUse32BitIndexMath so that the max index for 32-bit math is INT32_MAX, instead of UINT32_MAX. It also simplifies the division operation.
|
Hi @wickedfoo, I updated the code to remove the increment stuff and only leave the int division algorithm. Could you take another look? Regarding signed/unsigned integer, I think the point is moot, because (in the references I found) the fast division algorithm for signed integers always has more operations than the unsigned version. So I think they don't really give us any benefit here. |
|
Hi guys, any thoughts on this PR? |
|
Hi @wickedfoo, could you give your opinion? If this PR still looks like too much complication, I understand if you don't want to merge this, but I'd appreciate a decision rather than this PR staying in limbo forever. Thanks! |
|
@yongjik i think he does not get github notification emails. I will ping him directly. sorry for delay. |
|
Looking now. |
wickedfoo
left a comment
There was a problem hiding this comment.
Looks good to me. Any idea what the performance change of this is (i.e., is it worth it, and for what sizes)?
| __host__ __device__ T* get(IndexType linearIndex) const { | ||
| IndexType offset = 0; | ||
|
|
||
| for (int i = tinfo.dims - 1; i > 0; --i) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| @@ -0,0 +1,89 @@ | |||
| #ifndef THC_OFFSET_INFO_INC | |||
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Hi @wickedfoo, thanks for the review. I ran several hundred configurations of tensor operations (on GTX 1080 / CUDA 9.1), including The biggest win I could find was: We also have speedup for float operations, though not as dramatic: For some other float operations, I observed speedup of ~25% for mid-size tensors (around 1000x128), but it becomes smaller as tensors get bigger (~9% for 1024x1024, ~3% for 8000x3000), probably because memory latency dominates everything for these tensors. |
|
I don't know if I'm doing it right, but I followed the advice of the failed test log and ran On clean branch (2726550, ran three times): With this PR on top of it: So I think there's no meaningful difference on GTX-1080, but other GPUs might report different numbers, I guess. |
|
The GPU perf tests have been flaky recently, so you should ignore them for the purposes of assessing this PR. |
|
thanks @yongjik. sorry for the delay in review. |
|
No worries! Half of the delay was mine, after all. Thanks for the review. |
…els. (pytorch#5054) * Use pre-computed offset increments to avoid int division inside kernels. - OffsetInfo and OffsetIterator pre-computes the necessary coordinate change along each dimension, so that each successive offset can be computed using only addition/subtraction/comparisons. - Added IntDivider which supports "magic division" for uint32_t, thus eliminating integer divisions altogether for offset calculation, as long as indices fit in 32 bits. - In code paths with statically determined dimensions (Dims=1 or 2), kernel arguments now contain only the necessary data (instead of MAX_CUTORCH_DIMS of everything). - Fixed index overflow errors: for tensors with >= 2G elements, we used to have incorrect results or an infinite loop inside the kernel. TODO: The following pattern is broken for tensors with >= 2G elements. It will result in overflow, even if IndexType is uint64_t. Need to search and replace them. > for (IndexType linearIndex = blockIdx.x * blockDim.x + threadIdx.x; > linearIndex < totalElements; > linearIndex += gridDim.x * blockDim.x) { * Update CMakeLists.txt * Removed OffsetIterator, and kept only the fast integer division logic. - Also changed canUse32BitIndexMath so that the max index for 32-bit math is INT32_MAX, instead of UINT32_MAX. It also simplifies the division operation. * Merged OffsetInfo into THCTensorInfo.cuh.
OffsetInfo and OffsetIterator pre-computes the necessary coordinate
change along each dimension, so that each successive offset can be
computed using only addition/subtraction/comparisons.
Added IntDivider which supports "magic division" for uint32_t, thus
eliminating integer divisions altogether for offset calculation, as
long as indices fit in 32 bits.
In code paths with statically determined dimensions (Dims=1 or 2),
kernel arguments now contain only the necessary data (instead of
MAX_CUTORCH_DIMS of everything).
Fixed index overflow errors: for tensors with >= 2G elements, we used
to have incorrect results or an infinite loop inside the kernel.
TODO: The following pattern is broken for tensors with >= 2G elements.
It will result in overflow, even if IndexType is uint64_t. Need
to search and replace them.