THCReduce noncontigdim kernel improvements.#751
THCReduce noncontigdim kernel improvements.#751csarofeen wants to merge 1 commit intotorch:masterfrom csarofeen:reduction
Conversation
…uristics to improve smaller tensor reductions.
| for (IndexType i = 0; i < reductionSize; ++i) { | ||
| r = reduceOp(r, modifyOp(in.data[inOffset])); | ||
| inOffset += reductionStride; | ||
| __syncthreads(); |
There was a problem hiding this comment.
It is unfortunately necessary. We're trying to prevent warps from getting too far ahead which will have negative effects on the memory system.
There was a problem hiding this comment.
How can they get too far from each other? If the ops have uneven branches?
| }else{ | ||
| //x dim does different slices | ||
| //y dim helps with a slice | ||
| //If we only have 8 loops, don't bother sharing work across ydim |
There was a problem hiding this comment.
Yes, I'll fix this comment.
| if (!getNoncontigReduceGrid(outElements, grid)) { | ||
| return false; | ||
|
|
||
| //If there are a large number of outputs to the reduction, avoid syncthreads |
There was a problem hiding this comment.
Both kernels have syncthreads right now
There was a problem hiding this comment.
Yes, I'll fix this comment.
| long gridx = THCCeilDiv( outElements, (long)block.x); | ||
| if (gridx > 1024){ | ||
| long n_loops = THCCeilDiv(outElements, (long) (1024 * block.x) ); | ||
| gridx = outElements / (block.x*n_loops); |
There was a problem hiding this comment.
Are you sure this is ok? If you remove the ceil it is equivalent to setting gridx to 1024.
There was a problem hiding this comment.
I will review this again to make sure it is correct. It's mainly for load balancing the internal slice loop.
| __device__ __forceinline__ IndexType getReduceNoncontigDimSliceIndex() { | ||
| // Each thread handles one slice | ||
| return getLinearBlockId<IndexType>() * THC_NONCONTIG_REDUCE_BLOCK_SIZE + threadIdx.x; | ||
| #define LOCAL_MAX_BLOCK_SIZE 512 |
There was a problem hiding this comment.
It seems that this constant is used for shared mem size, but is not used when computing the block size. Is that ok?
There was a problem hiding this comment.
https://github.com/csarofeen/cutorch/blob/master/lib/THC/THCReduce.cuh#L239-L244
Ensures block size = 512, was a little bit of a misnomer as I enforced 512 instead of having it as a max.
There was a problem hiding this comment.
I know it enforces it, but I think it would be better to use the constant in both places. Otherwise there's no point in separating it from the code, because it can get out of sync
| *shmem = reduceOp(*shmem, *(shmem + blockDim.x * i) ); | ||
| } | ||
| out.data[outOffset] = *shmem; | ||
| } |
There was a problem hiding this comment.
Why is this just limited to groupID == 0 ? Wouldn't reducing to half the groups at each step be faster ?
There was a problem hiding this comment.
It might be, I could actually try as I forced blockdim.y to be a multiple of 2 so the logic shouldn't be too bad. Will check.
|
@apaszke I think @ngimel meant that we could use expand + add instead of |
|
@fmassa I know what's the deal with expand+add vs fill+addr, I'm just asking how is it related to this change. I don't know why I thought that expanded tensors are contiguous, nvm. |
killeent
left a comment
There was a problem hiding this comment.
What is the test plan for this? Do we have some benchmarking that shows this is faster?
| T init, | ||
| ModifyOp modifyOp, | ||
| ReduceOp reduceOp) { | ||
| IndexType threadLane = threadIdx.x; |
There was a problem hiding this comment.
threadLane seems like a bit of a misnomer here. I'm not sure how this corresponds to the lane in the warp.
There was a problem hiding this comment.
You're correct, was a remnant from when I was using a 1-D block. Will name it something more appropriate.
| IndexType threadLane = threadIdx.x; | ||
| IndexType groupID = threadIdx.y; | ||
| IndexType sliceIndex = blockIdx.x * blockDim.x + threadLane; | ||
| IndexType sliceStride = gridDim.x * blockDim.x; |
There was a problem hiding this comment.
Similarly, sliceStride is a bit confusing - this is actually the stride with which to get the next slice for reduction, but the variable name makes it sound like the stride for elements within a slice.
There was a problem hiding this comment.
Do you have a suggestion on this name?
| IndexType stride = reductionStride * blockDim.y; | ||
|
|
||
| for(IndexType i=groupID; i<reductionSize; i+=blockDim.y){ | ||
| (*shmem) = reduceOp(*shmem, modifyOp(in.data[inOffset]) ); |
There was a problem hiding this comment.
I'm not sure exactly how this works. I could be wrong, but aren't we hitting shared memory every time here? If we want different threads in the "group" to reduce things in registers wouldn't we need a local variable?
There was a problem hiding this comment.
Will check, but compiler tends to optimize it to registers (this is why there's a shared mem volatile flag).
|
https://gist.github.com/csarofeen/80e8e567d49e3a2511d6bcd7bd891a98 |
|
Still working on this. |
THCReduce noncontigdim kernel improvements. Added extra kernel and heuristics to improve smaller tensor reductions.