THCReduce noncontigdim kernel improvements. by csarofeen · Pull Request #751 · torch/cutorch

csarofeen · 2017-04-18T23:09:20Z

THCReduce noncontigdim kernel improvements. Added extra kernel and heuristics to improve smaller tensor reductions.

…uristics to improve smaller tensor reductions.

ngimel · 2017-04-18T23:35:55Z

@soumith, @apaszke This would allow to switch to expand_as/sum (instead of addr/gemv) when adding bias in linear functions, with performance gains, esp for smaller linear sizes.

apaszke

@ngimel how does it helps with bias addition? The output of mm and bias are always contiguous, and this PR only changes noncontig kernels.

apaszke · 2017-04-19T10:17:01Z

lib/THC/THCReduce.cuh

-  for (IndexType i = 0; i < reductionSize; ++i) {
-    r = reduceOp(r, modifyOp(in.data[inOffset]));
-    inOffset += reductionStride;
+    __syncthreads();


Is this necessary?

It is unfortunately necessary. We're trying to prevent warps from getting too far ahead which will have negative effects on the memory system.

How can they get too far from each other? If the ops have uneven branches?

apaszke · 2017-04-19T10:17:09Z

lib/THC/THCReduce.cuh

+    }else{
+      //x dim does different slices
+      //y dim helps with a slice
+      //If we only have 8 loops, don't bother sharing work across ydim


Should it be 16 loops?

Yes, I'll fix this comment.

apaszke · 2017-04-19T10:17:22Z

lib/THC/THCReduce.cuh

-    if (!getNoncontigReduceGrid(outElements, grid)) {
-      return false;
+
+    //If there are a large number of outputs to the reduction, avoid syncthreads


Both kernels have syncthreads right now

Yes, I'll fix this comment.

apaszke · 2017-04-19T10:17:40Z

lib/THC/THCReduce.cuh

+    long gridx  = THCCeilDiv( outElements, (long)block.x);
+    if (gridx > 1024){
+      long n_loops = THCCeilDiv(outElements, (long) (1024 * block.x) );
+      gridx = outElements / (block.x*n_loops);


Are you sure this is ok? If you remove the ceil it is equivalent to setting gridx to 1024.

I will review this again to make sure it is correct. It's mainly for load balancing the internal slice loop.

apaszke · 2017-04-19T10:18:17Z

lib/THC/THCReduce.cuh

-__device__ __forceinline__ IndexType getReduceNoncontigDimSliceIndex() {
-  // Each thread handles one slice
-  return getLinearBlockId<IndexType>() * THC_NONCONTIG_REDUCE_BLOCK_SIZE + threadIdx.x;
+#define LOCAL_MAX_BLOCK_SIZE 512


It seems that this constant is used for shared mem size, but is not used when computing the block size. Is that ok?

https://github.com/csarofeen/cutorch/blob/master/lib/THC/THCReduce.cuh#L239-L244
Ensures block size = 512, was a little bit of a misnomer as I enforced 512 instead of having it as a max.

I know it enforces it, but I think it would be better to use the constant in both places. Otherwise there's no point in separating it from the code, because it can get out of sync

pavanky · 2017-04-19T10:26:16Z

lib/THC/THCReduce.cuh

+        *shmem = reduceOp(*shmem, *(shmem + blockDim.x * i) );
+      }
+      out.data[outOffset] = *shmem;
+    }


Why is this just limited to groupID == 0 ? Wouldn't reducing to half the groups at each step be faster ?

It might be, I could actually try as I forced blockdim.y to be a multiple of 2 so the logic shouldn't be too bad. Will check.

fmassa · 2017-04-19T10:28:39Z

@apaszke I think @ngimel meant that we could use expand + add instead of addr in the forward of Linear, and sum the gradOutputs in the backward (instead of gemv). And it also avoids creating and resizing and filling with 1s at every iteration the add_buffer. I think this is related to this discussion in the slack

apaszke · 2017-04-19T12:41:44Z

@fmassa I know what's the deal with expand+add vs fill+addr, I'm just asking how is it related to this change. I don't know why I thought that expanded tensors are contiguous, nvm.

killeent

What is the test plan for this? Do we have some benchmarking that shows this is faster?

killeent · 2017-04-19T13:20:02Z

lib/THC/THCReduce.cuh

+                         T init,
+                         ModifyOp modifyOp,
+                         ReduceOp reduceOp) {
+  IndexType threadLane  = threadIdx.x;


threadLane seems like a bit of a misnomer here. I'm not sure how this corresponds to the lane in the warp.

You're correct, was a remnant from when I was using a 1-D block. Will name it something more appropriate.

killeent · 2017-04-19T13:21:16Z

lib/THC/THCReduce.cuh

+  IndexType threadLane  = threadIdx.x;
+  IndexType groupID      = threadIdx.y;
+  IndexType sliceIndex  = blockIdx.x * blockDim.x + threadLane;
+  IndexType sliceStride = gridDim.x * blockDim.x;


Similarly, sliceStride is a bit confusing - this is actually the stride with which to get the next slice for reduction, but the variable name makes it sound like the stride for elements within a slice.

Do you have a suggestion on this name?

killeent · 2017-04-19T13:26:35Z

lib/THC/THCReduce.cuh

+    IndexType stride = reductionStride * blockDim.y;
+
+    for(IndexType i=groupID; i<reductionSize; i+=blockDim.y){
+      (*shmem) = reduceOp(*shmem, modifyOp(in.data[inOffset]) );


I'm not sure exactly how this works. I could be wrong, but aren't we hitting shared memory every time here? If we want different threads in the "group" to reduce things in registers wouldn't we need a local variable?

Will check, but compiler tends to optimize it to registers (this is why there's a shared mem volatile flag).

csarofeen · 2017-04-19T16:03:33Z

https://gist.github.com/csarofeen/80e8e567d49e3a2511d6bcd7bd891a98
can be used for benchmarking linear improvements.
@apaszke Tensor is contiguous but reduction is on non-contiguous dimension.

csarofeen · 2017-04-25T16:43:18Z

Still working on this.

THCReduce noncontigdim kernel improvements. Added extra kernel and he…

6214195

…uristics to improve smaller tensor reductions.

apaszke reviewed Apr 19, 2017

View reviewed changes

pavanky reviewed Apr 19, 2017

View reviewed changes

killeent reviewed Apr 19, 2017

View reviewed changes

csarofeen closed this Apr 25, 2017

ngimel mentioned this pull request Jun 29, 2017

Use torch.matmul in nn.Linear pytorch/pytorch#1935

Merged

Conversation

csarofeen commented Apr 18, 2017

Uh oh!

ngimel commented Apr 18, 2017

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apaszke Apr 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fmassa commented Apr 19, 2017

Uh oh!

apaszke commented Apr 19, 2017

Uh oh!

killeent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csarofeen commented Apr 19, 2017

Uh oh!

csarofeen commented Apr 25, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

apaszke Apr 19, 2017 •

edited

Loading