dnn: LSTM optimisation by smbz · Pull Request #20658 · opencv/opencv

smbz · 2021-09-06T13:20:33Z

This uses the AVX-optimised fastGEMM1T for matrix multiplications where available, instead of the standard cv::gemm.

fastGEMM1T is already used by the fully-connected layer. This PR involves two minor modifications to allow use with the LSTM layer:

Use unaligned access. I don't believe this involves any performance hit in on modern CPUs (Nehalem and Bulldozer onwards) in the case where the address is actually aligned.
Allow for weight matrices where the number of columns is not a multiple of 8.

I have not enabled AVX-512 or RVV as I don't have access to such CPUs to test on. If there are no problems with this PR I'll send a separate PR which deals with ARM NEON.

Performance comparison (vs built-in gemm, i.e. build without a BLAS):

Median (ms)

                     Name of Test                      original4  AVX    AVX2     AVX        AVX2
                                                                                   vs         vs
                                                                               original4  original4
                                                                               (x-factor) (x-factor)
lstm::Layer_LSTM::BATCH=1, IN=64, HIDDEN=192, TS=100    22.851   3.819  3.093     5.98       7.39
lstm::Layer_LSTM::BATCH=1, IN=192, HIDDEN=192, TS=100   31.183   4.601  3.252     6.78       9.59
lstm::Layer_LSTM::BATCH=1, IN=192, HIDDEN=512, TS=100   144.946  19.302 15.570    7.51       9.31
lstm::Layer_LSTM::BATCH=1, IN=1024, HIDDEN=192, TS=100  93.162   10.639 8.068     8.76      11.55
lstm::Layer_LSTM::BATCH=64, IN=64, HIDDEN=192, TS=2     27.403   5.502  4.301     4.98       6.37
lstm::Layer_LSTM::BATCH=64, IN=192, HIDDEN=192, TS=2    40.038   6.830  5.548     5.86       7.22
lstm::Layer_LSTM::BATCH=64, IN=192, HIDDEN=512, TS=2    200.630  29.869 22.061    6.72       9.09
lstm::Layer_LSTM::BATCH=64, IN=1024, HIDDEN=192, TS=2   119.687  14.251 11.480    8.40      10.43
lstm::Layer_LSTM::BATCH=128, IN=64, HIDDEN=192, TS=2    57.895   10.394 9.101     5.57       6.36
lstm::Layer_LSTM::BATCH=128, IN=192, HIDDEN=192, TS=2   82.741   13.166 10.971    6.28       7.54
lstm::Layer_LSTM::BATCH=128, IN=192, HIDDEN=512, TS=2   394.310  60.755 45.158    6.49       8.73
lstm::Layer_LSTM::BATCH=128, IN=1024, HIDDEN=192, TS=2  245.459  28.773 23.413    8.53      10.48

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work (N/A)
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name. (N/A)
The feature is well documented and sample code can be built with the project CMake

force_builders=Linux AVX2

This uses the AVX-optimised fastGEMM1T for matrix multiplications where available, instead of the standard cv::gemm. fastGEMM1T is already used by the fully-connected layer. This commit involves two minor modifications: - Use unaligned access. I don't believe this involves any performance hit in on modern CPUs (Nehalem and Bulldozer onwards) in the case where the address is actually aligned. - Allow for weight matrices where the number of columns is not a multiple of 8. I have not enabled AVX-512 as I don't have an AVX-512 CPU to test on.

In this case the CV_TRY_X macros are defined to 0, rather than being undefined.

alalek

Thank you for contribution!

Please take a look on failed tests on Linux AVX2 builder ("default" checks).

alalek · 2021-09-06T17:45:26Z

modules/dnn/src/layers/layers_common.simd.hpp

+    uint32_t tailMaskArray[15];
+    for (int i = 0; i < 8; i++) {
+        tailMaskArray[i] = 0;
+    }
+    for (int i = 8; i < 15; i++) {
+        tailMaskArray[i] = 0xffffffffUL;
+    }


use static preallocated/reinitialized buffer:static uint32_t tailMaskArray[15] = { ... };

alalek · 2021-09-06T17:47:50Z

modules/dnn/src/layers/layers_common.simd.hpp

    _mm256_zeroupper();
 }

-


please avoid unrelated whitespace changes

alalek · 2021-09-06T17:49:38Z

modules/dnn/src/layers/recurrent_layers.cpp

+                    }
+                else
+#endif
                gemm(hInternal, Wh, 1, gates, 1, gates, GEMM_2_T);  //+Wh * h_{t-1}


Please add braces and indentation for gemm call:

{ gemm(...) }

alalek · 2021-09-06T17:51:11Z

modules/dnn/src/layers/recurrent_layers.cpp

+          useAVX(checkHardwareSupport(CPU_AVX)),
+          useAVX2(checkHardwareSupport(CPU_AVX2))


Please use conditional compilation:

#if CV_TRY_AVX , useAVX(checkHardwareSupport(CPU_AVX)) #endif #if CV_TRY_AVX2 , useAVX2(checkHardwareSupport(CPU_AVX2)) #endif

- Don't check hardware support for AVX(2) when dispatch is disabled for these - Add braces

The old tail handling in fastGEMM1T implicitly rounded vecsize up to the next multiple of 8, and the fully connected layer implements padding up to the next multiple of 8 to cope with this. The new tail handling does not round the vecsize upwards like this but it does require that the vecsize is at least 8. To adapt to the new tail handling, the fully connected layer now rounds vecsize itself at the same time as adding the padding(which makes more sense anyway). This also means that the fully connected layer always passes a vecsize of at least 8 to fastGEMM1T, which fixes the out-of-bounds access problems.

- Use static array for generating tail masks (as requested) - Apply tail mask to the weights as well as the input vectors to prevent spurious propagation of NaNs/Infs

smbz · 2021-09-07T12:55:15Z

@alalek Thanks for the feedback - I think I've addressed all the comments.

The failed test was pointing to an memory access issue caused by how the new tail handling interacted with the padding logic from the fully connected layer if the number of input channels was < 8. Now fixed.

There was also a potential issue where Infs in the weight matrix could be incorrectly converted to NaNs because of the multiplication by zero in the tail. Fixed by masking both the weights and the input vector.

It looks like the win64 build has failed before actually building the code - is there a way to re-run it without a new commit?

modules/dnn/src/layers/recurrent_layers.cpp

JulieBar · 2021-09-13T14:22:32Z

modules/dnn/src/layers/recurrent_layers.cpp

+#endif
+                {
+                    gemm(hInternal, Wh, 1, gates, 1, gates, GEMM_2_T);  //+Wh * h_{t-1}
+                }


is it possible to add optimization in case of peephole too?
https://github.com/opencv/opencv/blob/master/modules/dnn/src/layers/recurrent_layers.cpp#L377

Good point, I hadn't noticed that.

I think it might require a bit of a rethink to avoid having lots of copy/paste code for peepholes. The obvious thing is to add a helper which handles the dispatching, and that should be separate to the recurrent layers so it's reusable. It should probably also be a class to avoid checkHardwareSupport on each call. I'll have a think and suggest a design.

the existing PR can be merged as-is since it already adds speedup. you can send these changes (for peephole) later, in a separate PR

asmorkalov · 2021-09-20T05:53:28Z

@smbz Friendly reminder.

smbz · 2021-09-21T20:22:59Z

Thanks for the reminder! Just to say that this isn't dead, I should get round to it in the next few days.

asmorkalov · 2021-10-05T07:10:34Z

@smbz Friendly reminder.

JulieBar · 2021-10-08T07:29:02Z

modules/dnn/src/layers/recurrent_layers.cpp

            Mat cOutTs = produceCellOutput ? output[1].reshape(1, numSamplesTotal) : Mat();

+#if CV_TRY_AVX2 || CV_TRY_AVX
+            bool canUseAvx = gates.isContinuous() && bias.isContinuous()


looks good, thank you. please add the same for lines 414 and 429 below

alalek

Well done! Thank you for the contribution 👍

smbz added 4 commits September 6, 2021 14:06

Fix warning about initialisation order

d92c2bb

Remove C++11 syntax

b4f342d

Fix build when AVX(2) is not available

8cb5729

In this case the CV_TRY_X macros are defined to 0, rather than being undefined.

alalek reviewed Sep 6, 2021

View reviewed changes

smbz added 4 commits September 7, 2021 12:12

Minor changes as requested:

956072e

- Don't check hardware support for AVX(2) when dispatch is disabled for these - Add braces

Improve tail mask handling

5928686

- Use static array for generating tail masks (as requested) - Apply tail mask to the weights as well as the input vectors to prevent spurious propagation of NaNs/Infs

Revert whitespace change

e2acf83

asmorkalov requested a review from JulieBar September 12, 2021 19:03

JulieBar reviewed Sep 13, 2021

View reviewed changes

Improve readability of conditions for using AVX

83bb478

asmorkalov requested a review from JulieBar October 7, 2021 05:19

JulieBar reviewed Oct 8, 2021

View reviewed changes

dnn(lstm): minor coding style changes, replaced left aligned load

0c2f313

alalek approved these changes Nov 29, 2021

View reviewed changes

alalek merged commit ea7d4be into opencv:3.4 Nov 29, 2021

alalek mentioned this pull request Dec 3, 2021

(4.x) Merge 3.4 #21186

Merged

alalek mentioned this pull request Dec 30, 2021

(5.x) Merge 4.x #21371

Merged

alalek mentioned this pull request Feb 22, 2022

(5.x) Merge 4.x #21651

Merged

		useAVX(checkHardwareSupport(CPU_AVX)),
		useAVX2(checkHardwareSupport(CPU_AVX2))

Uh oh!

Conversation

smbz commented Sep 6, 2021 • edited by alalek Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smbz commented Sep 7, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asmorkalov commented Sep 20, 2021

Uh oh!

smbz commented Sep 21, 2021

Uh oh!

asmorkalov commented Oct 5, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

smbz commented Sep 6, 2021 •

edited by alalek

Loading