Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented Jan 11, 2016

Improvement of im2col() and col2im() functions on CPU.

@ajtulloch
Copy link
Contributor

What is the performance impact of this on a model like AlexNet or VGG? I see ~10% speedup for the forward pass on VGG-16 after patching this in, is that in line with your expectations?

@ghost
Copy link
Author

ghost commented Jan 12, 2016

Hi, I made only some general improvements of both mentioned functions on master. The changes affected Average Forward-Backward time: using CPU only, on platform with Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, compiled with MKL, I got x1.189 speedup for AlexNet, x1.077 for GoogleNet and x1.217 for CaffeNet; while compiled with OpenBLAS (single threaded): x1.095 for AlexNet and x1.056 for GoogleNet.

@ghost
Copy link
Author

ghost commented Jan 14, 2016

Here are some results from this layerwise benchmark with the new PR: https://github.com/soumith/convnet-benchmarks#layer-wise-benchmarking-last-updated-april-2015

Configuration is as previous: CPU_ONLY set to 1, with MKL and platform Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (36 threads each on separate core).

Results captured before and after the modification are attached:
Results.zip

Below are listed Average Forward-Backward times [ms] before and after CPU optimizations and improvement ratio:

run_imagenet.sh
alexnet.prototxt: 3215.80, 2737.10, x1.175
overfeat.prototxt: 5225.80, 4580.60, x1.141
vgg_a.prototxt: 12640.20, 10909.70, x1.159
googlenet.prototxt: 14614.90, 13508.50, x1.082

run_forcegradinput.sh
conv1.prototxt: 2781.70, 2031.80, x1.369
conv2.prototxt: 8927.90, 6924.90, x1.289
conv3.prototxt: 3325.40, 2798.00, x1.188
conv4.prototxt: 479.60, 380.60, x1.260
conv5.prototxt: 397.00, 349.50, x1.136

run_nogradinput.sh
conv1.prototxt: 1976.00, 1400.00, x1.411
conv2.prototxt: 6115.80, 4354.50, x1.404
conv3.prototxt: 2297.70, 1855.10, x1.239
conv4.prototxt: 344.60, 258.30, x1.334
conv5.prototxt: 282.50, 248.00, x1.139

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these arguments can remain const, right?

@longjon
Copy link
Contributor

longjon commented Jan 15, 2016

Thanks @intelmm, this looks like a nice boost. I've made a few comments as noted. Two points I'd like to clear up before merge:

  1. Let's be confident there are no serious regressions from this change. The provided benchmarks seem like pretty good evidence of this, but others may want to comment.
  2. Can you explain the change? Why is this code faster?

@ghost
Copy link
Author

ghost commented Jan 18, 2016

Thank you @longjon for review and comments! I'll commit updated version very soon.

Regarding your questions, here are some hints in general:

  • We have made some profiling using callgrind and it showed that the im2col function takes some serious part of total execution time.
  • Therefore even small performance changes in this function slightly affects the total execution time.
  • In our cases, function improvement changed the performance factor to x1.5 - x5 depending on conv layer type.
  • On the other hand, the col2im function does not consume as much resources, however its construction is almost the same as im2col, therefore can be easily replaced.
  • Moreover the replacement of col2im is recommended to have both pieces of code looking similarly for easier maintenance.

Why is this code faster?

  • In fact, the code has changed, but functionally it still works the same.
  • First of all, output elements are stored one by one, so instead of calculating index of the output element in array, it has been replaced by *(data_col++) = x;
  • Current condition used for padding checking is quite costly, therefore I made two changes:
    • For vertical coordinates the condition is moved into higher loop, and fills whole lines with zeros
    • The tricky part is to change condition from "if(0 <= a && a < N)" into "if((unsigned) a < (unsigned) N)"
      • This reduces two conditions into single by changing negative values to 0xFFFFF... like,
      • which is certain to be always higher than N (N is of type int i.e. is lower than 0x80000...).
  • All divide and modulo operations has been replaced by consecutive "for" loops (i.e. replaced by increment operations).

What could we do next?

  • The col2im function however could be still improved, but in this case it should be redesigned from scratch.
  • Current implementation clears the output matrix by assigning zeros to each output cell, and then accumulate the value with multiple read-write operations in fashion which completely misuse CPU cache.
  • It should accumulate the value for each single output cell in one variable and then store it.

@longjon
Copy link
Contributor

longjon commented Jan 18, 2016

Thanks for the explanation, that makes this clearer (and might be a useful reference for future optimization).

The unsigned casting is pretty tricky (although it does seem well-defined)... could we wrap it in an inline function that describes its purpose? (And use static_cast per note.)

An improved col2im implementation would be welcome in a future PR.

@ghost
Copy link
Author

ghost commented Jan 19, 2016

Dear @longjon, all the changes have been committed according to your comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a two-space indent per Google style guide (not sure why lint doesn't catch it).

@longjon
Copy link
Contributor

longjon commented Jan 19, 2016

The latest version seems understandable and the diff is much cleaner, thanks!

I've commented on one space error, and then please squash into one commit before merge.

@ghost
Copy link
Author

ghost commented Jan 20, 2016

Dear @longjon thank you for review. The commit is ready.

longjon added a commit that referenced this pull request Jan 20, 2016
Performance related update of im2col() and col2im() functions
@longjon longjon merged commit 813c3c9 into BVLC:master Jan 20, 2016
@longjon
Copy link
Contributor

longjon commented Jan 20, 2016

Clean diff, clean history, clear code... merging! Thanks, and looking forward to future CPU optimizations!

@ghost ghost deleted the im2col-speedup branch January 21, 2016 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants