Conversation
This uses the AVX-optimised fastGEMM1T for matrix multiplications where available, instead of the standard cv::gemm. fastGEMM1T is already used by the fully-connected layer. This commit involves two minor modifications: - Use unaligned access. I don't believe this involves any performance hit in on modern CPUs (Nehalem and Bulldozer onwards) in the case where the address is actually aligned. - Allow for weight matrices where the number of columns is not a multiple of 8. I have not enabled AVX-512 as I don't have an AVX-512 CPU to test on.
In this case the CV_TRY_X macros are defined to 0, rather than being undefined.
alalek
left a comment
There was a problem hiding this comment.
Thank you for contribution!
Please take a look on failed tests on Linux AVX2 builder ("default" checks).
| uint32_t tailMaskArray[15]; | ||
| for (int i = 0; i < 8; i++) { | ||
| tailMaskArray[i] = 0; | ||
| } | ||
| for (int i = 8; i < 15; i++) { | ||
| tailMaskArray[i] = 0xffffffffUL; | ||
| } |
There was a problem hiding this comment.
use static preallocated/reinitialized buffer:static uint32_t tailMaskArray[15] = { ... };
| _mm256_zeroupper(); | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
please avoid unrelated whitespace changes
| } | ||
| else | ||
| #endif | ||
| gemm(hInternal, Wh, 1, gates, 1, gates, GEMM_2_T); //+Wh * h_{t-1} |
There was a problem hiding this comment.
Please add braces and indentation for gemm call:
{
gemm(...)
}
| useAVX(checkHardwareSupport(CPU_AVX)), | ||
| useAVX2(checkHardwareSupport(CPU_AVX2)) |
There was a problem hiding this comment.
Please use conditional compilation:
#if CV_TRY_AVX
, useAVX(checkHardwareSupport(CPU_AVX))
#endif
#if CV_TRY_AVX2
, useAVX2(checkHardwareSupport(CPU_AVX2))
#endif
- Don't check hardware support for AVX(2) when dispatch is disabled for these - Add braces
The old tail handling in fastGEMM1T implicitly rounded vecsize up to the next multiple of 8, and the fully connected layer implements padding up to the next multiple of 8 to cope with this. The new tail handling does not round the vecsize upwards like this but it does require that the vecsize is at least 8. To adapt to the new tail handling, the fully connected layer now rounds vecsize itself at the same time as adding the padding(which makes more sense anyway). This also means that the fully connected layer always passes a vecsize of at least 8 to fastGEMM1T, which fixes the out-of-bounds access problems.
- Use static array for generating tail masks (as requested) - Apply tail mask to the weights as well as the input vectors to prevent spurious propagation of NaNs/Infs
|
@alalek Thanks for the feedback - I think I've addressed all the comments. The failed test was pointing to an memory access issue caused by how the new tail handling interacted with the padding logic from the fully connected layer if the number of input channels was < 8. Now fixed. There was also a potential issue where Infs in the weight matrix could be incorrectly converted to NaNs because of the multiplication by zero in the tail. Fixed by masking both the weights and the input vector. It looks like the win64 build has failed before actually building the code - is there a way to re-run it without a new commit? |
| #endif | ||
| { | ||
| gemm(hInternal, Wh, 1, gates, 1, gates, GEMM_2_T); //+Wh * h_{t-1} | ||
| } |
There was a problem hiding this comment.
is it possible to add optimization in case of peephole too?
https://github.com/opencv/opencv/blob/master/modules/dnn/src/layers/recurrent_layers.cpp#L377
There was a problem hiding this comment.
Good point, I hadn't noticed that.
I think it might require a bit of a rethink to avoid having lots of copy/paste code for peepholes. The obvious thing is to add a helper which handles the dispatching, and that should be separate to the recurrent layers so it's reusable. It should probably also be a class to avoid checkHardwareSupport on each call. I'll have a think and suggest a design.
There was a problem hiding this comment.
the existing PR can be merged as-is since it already adds speedup. you can send these changes (for peephole) later, in a separate PR
|
@smbz Friendly reminder. |
|
Thanks for the reminder! Just to say that this isn't dead, I should get round to it in the next few days. |
|
@smbz Friendly reminder. |
| Mat cOutTs = produceCellOutput ? output[1].reshape(1, numSamplesTotal) : Mat(); | ||
|
|
||
| #if CV_TRY_AVX2 || CV_TRY_AVX | ||
| bool canUseAvx = gates.isContinuous() && bias.isContinuous() |
There was a problem hiding this comment.
looks good, thank you. please add the same for lines 414 and 429 below
alalek
left a comment
There was a problem hiding this comment.
Well done! Thank you for the contribution 👍
This uses the AVX-optimised fastGEMM1T for matrix multiplications where available, instead of the standard cv::gemm.
fastGEMM1T is already used by the fully-connected layer. This PR involves two minor modifications to allow use with the LSTM layer:
I have not enabled AVX-512 or RVV as I don't have access to such CPUs to test on. If there are no problems with this PR I'll send a separate PR which deals with ARM NEON.
Performance comparison (vs built-in gemm, i.e. build without a BLAS):
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name. (N/A)