Skip to content

dnn: LSTM optimisation#20658

Merged
alalek merged 10 commits intoopencv:3.4from
smbz:lstm_optimisation
Nov 29, 2021
Merged

dnn: LSTM optimisation#20658
alalek merged 10 commits intoopencv:3.4from
smbz:lstm_optimisation

Conversation

@smbz
Copy link
Copy Markdown
Contributor

@smbz smbz commented Sep 6, 2021

This uses the AVX-optimised fastGEMM1T for matrix multiplications where available, instead of the standard cv::gemm.

fastGEMM1T is already used by the fully-connected layer. This PR involves two minor modifications to allow use with the LSTM layer:

  • Use unaligned access. I don't believe this involves any performance hit in on modern CPUs (Nehalem and Bulldozer onwards) in the case where the address is actually aligned.
  • Allow for weight matrices where the number of columns is not a multiple of 8.

I have not enabled AVX-512 or RVV as I don't have access to such CPUs to test on. If there are no problems with this PR I'll send a separate PR which deals with ARM NEON.

Performance comparison (vs built-in gemm, i.e. build without a BLAS):

Median (ms)

                     Name of Test                      original4  AVX    AVX2     AVX        AVX2
                                                                                   vs         vs
                                                                               original4  original4
                                                                               (x-factor) (x-factor)
lstm::Layer_LSTM::BATCH=1, IN=64, HIDDEN=192, TS=100    22.851   3.819  3.093     5.98       7.39
lstm::Layer_LSTM::BATCH=1, IN=192, HIDDEN=192, TS=100   31.183   4.601  3.252     6.78       9.59
lstm::Layer_LSTM::BATCH=1, IN=192, HIDDEN=512, TS=100   144.946  19.302 15.570    7.51       9.31
lstm::Layer_LSTM::BATCH=1, IN=1024, HIDDEN=192, TS=100  93.162   10.639 8.068     8.76      11.55
lstm::Layer_LSTM::BATCH=64, IN=64, HIDDEN=192, TS=2     27.403   5.502  4.301     4.98       6.37
lstm::Layer_LSTM::BATCH=64, IN=192, HIDDEN=192, TS=2    40.038   6.830  5.548     5.86       7.22
lstm::Layer_LSTM::BATCH=64, IN=192, HIDDEN=512, TS=2    200.630  29.869 22.061    6.72       9.09
lstm::Layer_LSTM::BATCH=64, IN=1024, HIDDEN=192, TS=2   119.687  14.251 11.480    8.40      10.43
lstm::Layer_LSTM::BATCH=128, IN=64, HIDDEN=192, TS=2    57.895   10.394 9.101     5.57       6.36
lstm::Layer_LSTM::BATCH=128, IN=192, HIDDEN=192, TS=2   82.741   13.166 10.971    6.28       7.54
lstm::Layer_LSTM::BATCH=128, IN=192, HIDDEN=512, TS=2   394.310  60.755 45.158    6.49       8.73
lstm::Layer_LSTM::BATCH=128, IN=1024, HIDDEN=192, TS=2  245.459  28.773 23.413    8.53      10.48

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work (N/A)
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name. (N/A)
  • The feature is well documented and sample code can be built with the project CMake
force_builders=Linux AVX2

smbz added 4 commits September 6, 2021 14:06
This uses the AVX-optimised fastGEMM1T for matrix multiplications where available, instead of the standard cv::gemm.

fastGEMM1T is already used by the fully-connected layer.  This commit involves two minor modifications:
 - Use unaligned access.  I don't believe this involves any performance hit in on modern CPUs (Nehalem and Bulldozer onwards) in the case where the address is actually aligned.
 - Allow for weight matrices where the number of columns is not a multiple of 8.

I have not enabled AVX-512 as I don't have an AVX-512 CPU to test on.
In this case the CV_TRY_X macros are defined to 0, rather than being undefined.
Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contribution!

Please take a look on failed tests on Linux AVX2 builder ("default" checks).

Comment on lines +560 to +566
uint32_t tailMaskArray[15];
for (int i = 0; i < 8; i++) {
tailMaskArray[i] = 0;
}
for (int i = 8; i < 15; i++) {
tailMaskArray[i] = 0xffffffffUL;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use static preallocated/reinitialized buffer:static uint32_t tailMaskArray[15] = { ... };

_mm256_zeroupper();
}


Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid unrelated whitespace changes

}
else
#endif
gemm(hInternal, Wh, 1, gates, 1, gates, GEMM_2_T); //+Wh * h_{t-1}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add braces and indentation for gemm call:

{
    gemm(...)
}

Comment on lines +129 to +130
useAVX(checkHardwareSupport(CPU_AVX)),
useAVX2(checkHardwareSupport(CPU_AVX2))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use conditional compilation:

#if CV_TRY_AVX
    , useAVX(checkHardwareSupport(CPU_AVX))
#endif
#if CV_TRY_AVX2
    , useAVX2(checkHardwareSupport(CPU_AVX2))
#endif

smbz added 4 commits September 7, 2021 12:12
 - Don't check hardware support for AVX(2) when dispatch is disabled for these
 - Add braces
The old tail handling in fastGEMM1T implicitly rounded vecsize up to the next multiple of 8, and the fully connected layer implements padding up to the next multiple of 8 to cope with this.  The new tail handling does not round the vecsize upwards like this but it does require that the vecsize is at least 8.  To adapt to the new tail handling, the fully connected layer now rounds vecsize itself at the same time as adding the padding(which makes more sense anyway).

This also means that the fully connected layer always passes a vecsize of at least 8 to fastGEMM1T, which fixes the out-of-bounds access problems.
 - Use static array for generating tail masks (as requested)
 - Apply tail mask to the weights as well as the input vectors to prevent spurious propagation of NaNs/Infs
@smbz
Copy link
Copy Markdown
Contributor Author

smbz commented Sep 7, 2021

@alalek Thanks for the feedback - I think I've addressed all the comments.

The failed test was pointing to an memory access issue caused by how the new tail handling interacted with the padding logic from the fully connected layer if the number of input channels was < 8. Now fixed.

There was also a potential issue where Infs in the weight matrix could be incorrectly converted to NaNs because of the multiplication by zero in the tail. Fixed by masking both the weights and the input vector.

It looks like the win64 build has failed before actually building the code - is there a way to re-run it without a new commit?

@asmorkalov asmorkalov requested a review from JulieBar September 12, 2021 19:03
#endif
{
gemm(hInternal, Wh, 1, gates, 1, gates, GEMM_2_T); //+Wh * h_{t-1}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to add optimization in case of peephole too?
https://github.com/opencv/opencv/blob/master/modules/dnn/src/layers/recurrent_layers.cpp#L377

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I hadn't noticed that.

I think it might require a bit of a rethink to avoid having lots of copy/paste code for peepholes. The obvious thing is to add a helper which handles the dispatching, and that should be separate to the recurrent layers so it's reusable. It should probably also be a class to avoid checkHardwareSupport on each call. I'll have a think and suggest a design.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the existing PR can be merged as-is since it already adds speedup. you can send these changes (for peephole) later, in a separate PR

@asmorkalov
Copy link
Copy Markdown
Contributor

@smbz Friendly reminder.

@smbz
Copy link
Copy Markdown
Contributor Author

smbz commented Sep 21, 2021

Thanks for the reminder! Just to say that this isn't dead, I should get round to it in the next few days.

@asmorkalov
Copy link
Copy Markdown
Contributor

@smbz Friendly reminder.

@asmorkalov asmorkalov requested a review from JulieBar October 7, 2021 05:19
Mat cOutTs = produceCellOutput ? output[1].reshape(1, numSamplesTotal) : Mat();

#if CV_TRY_AVX2 || CV_TRY_AVX
bool canUseAvx = gates.isContinuous() && bias.isContinuous()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thank you. please add the same for lines 414 and 429 below

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! Thank you for the contribution 👍

@alalek alalek merged commit ea7d4be into opencv:3.4 Nov 29, 2021
@alalek alalek mentioned this pull request Dec 3, 2021
@alalek alalek mentioned this pull request Dec 30, 2021
@alalek alalek mentioned this pull request Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants