Conversation
The opencv infrastructure mostly has the basics for supporting avx512 math functions, but it wasn't hooked up (likely due to lack of users) In order to compile the DNN functions for AVX512, a few things need to be hooked up and this patch does that Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
|
Thank you for the contribution! May I ask you to share performance numbers for AVX2 vs AVX512. Something like these:
|
This patch adds AVX512 optimized fastConv as well as the hookups needed to get these called in the convolution_layer. AVX512 fastConv is code-identical on a C level to the AVX2 one, but is measurably faster due to AVX512 having more registers available to cache results in. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
|
(somewhat cleaned up and simplified patch updated, which shows this data) Geometric mean
|
|
Thank you for the update! It seems there is old design problem with AVX512 - see this issue: #8974 |
|
I'm pretty sure I accidentally fixed that other issue in the first patch of this PR; yes AVX512 is a family not a single point; generally I check DQ since that is what is actually shipping as a superset of the base.. (e.g. Core i9 etc go beyond the 512F base) |
|
(or in other words, 512DQ is a reasonable line to put in the sand for "what is AVX 512") |
|
Current CMake scripts are designed for support "atomic" instruction sets (like, AVX512DQ) which has 1:1 mapping to compiler flags / processor features. Groups, like AVX512 are not properly supported for now. Perhaps AVX512 can be replaced with AVX512-KNL (Knights Landing), AVX512-SKX (Skylake with BW, DQ, VL), AVX512-CNL (CannonLake with additional IFMA, VBMI) groups. But I'm not sure that it is a good idea. So probably we should start from atomic features. Groups can be added later. |
|
(I'm quite aware of the Intel roadmap/instructions since that's my dayjob ;-) ) 512DQ is a reasonable baseline in terms of what is shipping/being used by people, where the CNL changes can be add-ons similar to how FMA3 is for AVX2. |
|
Great! I believe you can propose better solution. I just mean that current patch doesn't work as expected: This is reproducer for all platforms (including non-AVX512). Build problem reproducer for AVX512 systems is quite straightforward: (can be emulated via SDE tool: We need to fix these builds before merging. |
|
BTW, OpenCV knows these AVX512 CPU capabilities (there is no "AVX512", but there is "AVX_512DQ"). |
|
In this case this line should be adopted too (to avoid compiler generation of non-supported instructions, see #6990). My suggestion is to rename current "AVX512" => "AVX_512DQ" and fix compiler flags. I believe it is enough to support the current patch. |
|
Could you take a look on these changes: alalek@pr10416_r ?
|
|
test your patch on top and it works; updated this PR. |
|
Thank you for checking! |
|
Looking at the details .. it's not quite there... |
|
This usually means that compiler optimizes other code (without direct intrinsic calls) by yourself, and "-mavx512f" option is not enough. Does build in this way works well? |
|
On Thu, Dec 28, 2017 at 19:25 Alexander Alekhin ***@***.***> wrote:
This usually means that compiler optimizes other code (without direct
intrinsic calls) by yourself, and "-mavx512f" option is not enough.
Does build in this way works well?
cmake -DCPU_BASELINE=NATIVE <opencv_src_dir>
or
CXXFLAGS="-march=native" cmake <opencv_src_dir>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10416 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABPeFbVYfVFTDbKfr13jUlI7fZuAuO3Oks5tFFubgaJpZM4RMZQF>
.
The problem is needing -mavx512vl. Vl gives the compiler 16 extra registers
to work with, even when using intrinsically.
The main loop needs 19 or it spills.. and avoiding those spills is what
gives most of the perf gain
|
|
Thank you for explanation! I will take a look on this. BTW, What CMake options do you use? (What compiler?) |
This pull request adds support for AVX512 instructions for some of the DNN operations