Performance related update of im2col() and col2im() functions #3536

ghost · 2016-01-11T14:19:04Z

Improvement of im2col() and col2im() functions on CPU.

ajtulloch · 2016-01-11T16:38:39Z

What is the performance impact of this on a model like AlexNet or VGG? I see ~10% speedup for the forward pass on VGG-16 after patching this in, is that in line with your expectations?

ghost · 2016-01-12T07:48:44Z

Hi, I made only some general improvements of both mentioned functions on master. The changes affected Average Forward-Backward time: using CPU only, on platform with Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz, compiled with MKL, I got x1.189 speedup for AlexNet, x1.077 for GoogleNet and x1.217 for CaffeNet; while compiled with OpenBLAS (single threaded): x1.095 for AlexNet and x1.056 for GoogleNet.

ghost · 2016-01-14T13:02:57Z

Here are some results from this layerwise benchmark with the new PR: https://github.com/soumith/convnet-benchmarks#layer-wise-benchmarking-last-updated-april-2015

Configuration is as previous: CPU_ONLY set to 1, with MKL and platform Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (36 threads each on separate core).

Results captured before and after the modification are attached:
Results.zip

Below are listed Average Forward-Backward times [ms] before and after CPU optimizations and improvement ratio:

run_imagenet.sh
alexnet.prototxt: 3215.80, 2737.10, x1.175
overfeat.prototxt: 5225.80, 4580.60, x1.141
vgg_a.prototxt: 12640.20, 10909.70, x1.159
googlenet.prototxt: 14614.90, 13508.50, x1.082

run_forcegradinput.sh
conv1.prototxt: 2781.70, 2031.80, x1.369
conv2.prototxt: 8927.90, 6924.90, x1.289
conv3.prototxt: 3325.40, 2798.00, x1.188
conv4.prototxt: 479.60, 380.60, x1.260
conv5.prototxt: 397.00, 349.50, x1.136

run_nogradinput.sh
conv1.prototxt: 1976.00, 1400.00, x1.411
conv2.prototxt: 6115.80, 4354.50, x1.404
conv3.prototxt: 2297.70, 1855.10, x1.239
conv4.prototxt: 344.60, 258.30, x1.334
conv5.prototxt: 282.50, 248.00, x1.139

longjon · 2016-01-15T23:39:47Z

src/caffe/util/im2col.cpp

Most of these arguments can remain const, right?

longjon · 2016-01-15T23:57:52Z

Thanks @intelmm, this looks like a nice boost. I've made a few comments as noted. Two points I'd like to clear up before merge:

Let's be confident there are no serious regressions from this change. The provided benchmarks seem like pretty good evidence of this, but others may want to comment.
Can you explain the change? Why is this code faster?

ghost · 2016-01-18T13:25:39Z

Thank you @longjon for review and comments! I'll commit updated version very soon.

Regarding your questions, here are some hints in general:

We have made some profiling using callgrind and it showed that the im2col function takes some serious part of total execution time.
Therefore even small performance changes in this function slightly affects the total execution time.
In our cases, function improvement changed the performance factor to x1.5 - x5 depending on conv layer type.
On the other hand, the col2im function does not consume as much resources, however its construction is almost the same as im2col, therefore can be easily replaced.
Moreover the replacement of col2im is recommended to have both pieces of code looking similarly for easier maintenance.

Why is this code faster?

In fact, the code has changed, but functionally it still works the same.
First of all, output elements are stored one by one, so instead of calculating index of the output element in array, it has been replaced by *(data_col++) = x;
Current condition used for padding checking is quite costly, therefore I made two changes:
- For vertical coordinates the condition is moved into higher loop, and fills whole lines with zeros
- The tricky part is to change condition from "if(0 <= a && a < N)" into "if((unsigned) a < (unsigned) N)"
  - This reduces two conditions into single by changing negative values to 0xFFFFF... like,
  - which is certain to be always higher than N (N is of type int i.e. is lower than 0x80000...).
All divide and modulo operations has been replaced by consecutive "for" loops (i.e. replaced by increment operations).

What could we do next?

The col2im function however could be still improved, but in this case it should be redesigned from scratch.
Current implementation clears the output matrix by assigning zeros to each output cell, and then accumulate the value with multiple read-write operations in fashion which completely misuse CPU cache.
It should accumulate the value for each single output cell in one variable and then store it.

longjon · 2016-01-18T22:31:10Z

Thanks for the explanation, that makes this clearer (and might be a useful reference for future optimization).

The unsigned casting is pretty tricky (although it does seem well-defined)... could we wrap it in an inline function that describes its purpose? (And use static_cast per note.)

An improved col2im implementation would be welcome in a future PR.

ghost · 2016-01-19T08:49:29Z

Dear @longjon, all the changes have been committed according to your comments.

longjon · 2016-01-19T21:38:03Z

src/caffe/util/im2col.cpp

This should be a two-space indent per Google style guide (not sure why lint doesn't catch it).

longjon · 2016-01-19T21:40:07Z

The latest version seems understandable and the diff is much cleaner, thanks!

I've commented on one space error, and then please squash into one commit before merge.

ghost · 2016-01-20T13:21:11Z

Dear @longjon thank you for review. The commit is ready.

Performance related update of im2col() and col2im() functions

longjon · 2016-01-20T21:38:12Z

Clean diff, clean history, clear code... merging! Thanks, and looking forward to future CPU optimizations!

shelhamer added the speed-up label Jan 12, 2016

longjon reviewed Jan 15, 2016
View reviewed changes

src/caffe/util/im2col.cpp Outdated

Copy link

Contributor

longjon Jan 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these arguments can remain const, right?

longjon reviewed Jan 19, 2016
View reviewed changes

src/caffe/util/im2col.cpp Outdated

Copy link

Contributor

longjon Jan 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a two-space indent per Google style guide (not sure why lint doesn't catch it).

Performance related update of im2col() and col2im() functions

581c1cc

longjon added a commit that referenced this pull request Jan 20, 2016

Merge pull request #3536 from intelcaffe/im2col-speedup

813c3c9

Performance related update of im2col() and col2im() functions

longjon merged commit 813c3c9 into BVLC:master Jan 20, 2016

ghost deleted the im2col-speedup branch January 21, 2016 12:04

Performance related update of im2col() and col2im() functions #3536

Performance related update of im2col() and col2im() functions #3536

Uh oh!

Conversation

ghost commented Jan 11, 2016

Uh oh!

ajtulloch commented Jan 11, 2016

Uh oh!

ghost commented Jan 12, 2016

Uh oh!

ghost commented Jan 14, 2016

Uh oh!

longjon Jan 15, 2016

Choose a reason for hiding this comment

Uh oh!

longjon commented Jan 15, 2016

Uh oh!

ghost commented Jan 18, 2016

Uh oh!

longjon commented Jan 18, 2016

Uh oh!

ghost commented Jan 19, 2016

Uh oh!

longjon Jan 19, 2016

Choose a reason for hiding this comment

Uh oh!

longjon commented Jan 19, 2016

Uh oh!

ghost commented Jan 20, 2016

Uh oh!

longjon commented Jan 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants