Avx512 by fenrus75 · Pull Request #10416 · opencv/opencv

fenrus75 · 2017-12-25T21:11:05Z

This pull request adds support for AVX512 instructions for some of the DNN operations

allow_multiple_commits=1

The opencv infrastructure mostly has the basics for supporting avx512 math functions, but it wasn't hooked up (likely due to lack of users) In order to compile the DNN functions for AVX512, a few things need to be hooked up and this patch does that Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

alalek · 2017-12-26T05:36:45Z

Thank you for the contribution!

May I ask you to share performance numbers for AVX2 vs AVX512.

Something like these:

setup env: OPENCV_TEST_DATA_PATH=<opencv_extra>/testdata
setup env: OPENCV_DNN_TEST_DATA_PATH=<some_dir_with_dnn_subfolder>
use this script to download test DNN models (into "dnn" subfolder, size ~2Gb): https://github.com/opencv/opencv_extra/blob/master/testdata/dnn/download_models.py

compile baseline code (without patch) and run:

./bin/opencv_perf_dnn --gtest_output=xml:base.xml

compile patched code and run:

./bin/opencv_perf_dnn --gtest_output=xml:optimized.xml

generate report:

python <opencv_src>/modules/ts/misc/summary.py base.xml optimized.xml

post results here (use "-o markdown" to generate report compatible with GitHub comments)

This patch adds AVX512 optimized fastConv as well as the hookups needed to get these called in the convolution_layer. AVX512 fastConv is code-identical on a C level to the AVX2 one, but is measurably faster due to AVX512 having more registers available to cache results in. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

fenrus75 · 2017-12-26T16:03:32Z

(somewhat cleaned up and simplified patch updated, which shows this data)

Geometric mean

Name of Test	base	optimized	optimized vs base (x-factor)
AlexNet::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU)	7.409 ms	7.156 ms	1.04
ENet::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU)	22.722 ms	20.854 ms	1.09
GoogLeNet::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU)	11.680 ms	10.386 ms	1.12
Inception_5h::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU)	12.478 ms	11.133 ms	1.12
MobileNet_SSD_Caffe::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU)	14.377 ms	13.106 ms	1.10
OpenFace::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU)	4.172 ms	4.060 ms	1.03
ResNet50::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU)	33.727 ms	30.341 ms	1.11
SSD::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU)	162.817 ms	143.521 ms	1.13
SqueezeNet_v1_1::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU)	2.744 ms	2.349 ms	1.17
perf::ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF)	0.243 ms	0.216 ms	1.13
perf::ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON)	0.077 ms	0.058 ms	1.33
perf::ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF)	0.235 ms	0.205 ms	1.15
perf::ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON)	0.074 ms	0.056 ms	1.33
perf::ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF)	0.261 ms	0.216 ms	1.21
perf::ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON)	0.083 ms	0.061 ms	1.37
perf::ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF)	0.364 ms	0.296 ms	1.23
perf::ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON)	0.111 ms	0.081 ms	1.38
perf::ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF)	0.154 ms	0.126 ms	1.22
perf::ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON)	0.067 ms	0.054 ms	1.22
perf::ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF)	0.335 ms	0.314 ms	1.07
perf::ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON)	0.187 ms	0.160 ms	1.17
perf::ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF)	0.410 ms	0.362 ms	1.13
perf::ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON)	0.129 ms	0.101 ms	1.28
perf::ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF)	0.499 ms	0.433 ms	1.15
perf::ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON)	0.154 ms	0.120 ms	1.28
perf::ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF)	1.507 ms	1.273 ms	1.18
perf::ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON)	0.400 ms	0.326 ms	1.23
perf::ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF)	2.563 ms	2.387 ms	1.07
perf::ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON)	0.667 ms	0.629 ms	1.06
perf::ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF)	1.105 ms	0.977 ms	1.13
perf::ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON)	0.386 ms	0.316 ms	1.22
perf::ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF)	2.389 ms	1.999 ms	1.20
perf::ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON)	1.009 ms	0.898 ms	1.12
perf::ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF)	0.737 ms	0.673 ms	1.09
perf::ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON)	0.235 ms	0.194 ms	1.21
perf::ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF)	1.060 ms	0.917 ms	1.16
perf::ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON)	0.316 ms	0.252 ms	1.26
perf::ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF)	4.063 ms	3.885 ms	1.05
perf::ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON)	1.045 ms	1.041 ms	1.00
perf::ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF)	6.947 ms	6.810 ms	1.02
perf::ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON)	1.762 ms	1.792 ms	0.98
perf::ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF)	2.663 ms	2.251 ms	1.18
perf::ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON)	0.700 ms	0.612 ms	1.14
perf::ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF)	5.185 ms	4.394 ms	1.18
perf::ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON)	1.650 ms	1.514 ms	1.09
perf::ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF)	0.356 ms	0.321 ms	1.11
perf::ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON)	0.344 ms	0.317 ms	1.08
perf::ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF)	0.475 ms	0.434 ms	1.10
perf::ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON)	0.476 ms	0.415 ms	1.15
perf::ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF)	2.047 ms	2.130 ms	0.96
perf::ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON)	2.009 ms	2.161 ms	0.93
perf::ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF)	7.017 ms	6.894 ms	1.02
perf::ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON)	7.049 ms	6.949 ms	1.01
perf::ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF)	2.640 ms	2.453 ms	1.08
perf::ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON)	2.617 ms	2.411 ms	1.09
perf::ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF)	8.449 ms	7.889 ms	1.07
perf::ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON)	8.357 ms	7.763 ms	1.08
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF)	0.246 ms	0.218 ms	1.13
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON)	0.076 ms	0.057 ms	1.32
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF)	0.237 ms	0.205 ms	1.16
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON)	0.074 ms	0.055 ms	1.34
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF)	0.261 ms	0.216 ms	1.21
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON)	0.083 ms	0.061 ms	1.36
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF)	0.362 ms	0.293 ms	1.24
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON)	0.112 ms	0.080 ms	1.39
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF)	0.154 ms	0.127 ms	1.21
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON)	0.066 ms	0.054 ms	1.22
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF)	0.343 ms	0.311 ms	1.10
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON)	0.176 ms	0.161 ms	1.09
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF)	0.403 ms	0.362 ms	1.11
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON)	0.127 ms	0.101 ms	1.25
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF)	0.502 ms	0.437 ms	1.15
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON)	0.155 ms	0.120 ms	1.29
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF)	1.507 ms	1.303 ms	1.16
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON)	0.400 ms	0.326 ms	1.23
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF)	2.557 ms	2.245 ms	1.14
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON)	0.667 ms	0.581 ms	1.15
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF)	1.101 ms	0.936 ms	1.18
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON)	0.388 ms	0.317 ms	1.22
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF)	2.378 ms	2.004 ms	1.19
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON)	1.002 ms	0.880 ms	1.14
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF)	0.738 ms	0.666 ms	1.11
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON)	0.237 ms	0.189 ms	1.25
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF)	1.070 ms	0.917 ms	1.17
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON)	0.316 ms	0.250 ms	1.26
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF)	4.088 ms	3.808 ms	1.07
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON)	1.045 ms	1.007 ms	1.04
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF)	6.950 ms	6.364 ms	1.09
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON)	1.763 ms	1.698 ms	1.04
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF)	2.638 ms	2.206 ms	1.20
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON)	0.710 ms	0.608 ms	1.17
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF)	5.133 ms	4.421 ms	1.16
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON)	1.596 ms	1.450 ms	1.10
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF)	0.351 ms	0.322 ms	1.09
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON)	0.338 ms	0.312 ms	1.08
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF)	0.469 ms	0.422 ms	1.11
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON)	0.469 ms	0.423 ms	1.11
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF)	2.020 ms	2.083 ms	0.97
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON)	2.004 ms	2.079 ms	0.96
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF)	7.070 ms	6.960 ms	1.02
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON)	7.007 ms	7.012 ms	1.00
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF)	2.576 ms	2.402 ms	1.07
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON)	2.634 ms	2.397 ms	1.10
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF)	8.496 ms	7.721 ms	1.10
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON)	8.311 ms	7.868 ms	1.06

alalek · 2017-12-27T17:39:36Z

Thank you for the update!

It seems there is old design problem with AVX512 - see this issue: #8974
Problem here is that there are many "independent" AVX512 instruction sets (you use/check for AVX512DQ in this patch).

fenrus75 · 2017-12-27T17:46:09Z

I'm pretty sure I accidentally fixed that other issue in the first patch of this PR;
AVX512 works with this PR.

yes AVX512 is a family not a single point; generally I check DQ since that is what is actually shipping as a superset of the base.. (e.g. Core i9 etc go beyond the 512F base)

fenrus75 · 2017-12-27T17:47:24Z

(or in other words, 512DQ is a reasonable line to put in the sand for "what is AVX 512")

alalek · 2017-12-27T18:12:16Z

Current CMake scripts are designed for support "atomic" instruction sets (like, AVX512DQ) which has 1:1 mapping to compiler flags / processor features.

Groups, like AVX512 are not properly supported for now. Perhaps AVX512 can be replaced with AVX512-KNL (Knights Landing), AVX512-SKX (Skylake with BW, DQ, VL), AVX512-CNL (CannonLake with additional IFMA, VBMI) groups. But I'm not sure that it is a good idea.
BTW, Intel compiler defined this flag "COMMON-AVX512", "MIC-AVX512", "CORE-AVX512" (DQ is here)

So probably we should start from atomic features. Groups can be added later.

fenrus75 · 2017-12-27T18:23:52Z

(I'm quite aware of the Intel roadmap/instructions since that's my dayjob ;-) )

512DQ is a reasonable baseline in terms of what is shipping/being used by people, where the CNL changes can be add-ons similar to how FMA3 is for AVX2.

alalek · 2017-12-27T19:07:53Z

Great! I believe you can propose better solution.

I just mean that current patch doesn't work as expected:

$ cmake -DCPU_BASELINE=AVX512 <opencv_src_dir>
$ make
...
.../opencv/modules/core/src/system.cpp: In member function ‘void cv::HWFeatures::initialize()’:
.../build/opencv/cv_cpu_config.h:51:7: error: ‘CV_CPU_AVX512’ was not declared in this scope
     , CV_CPU_AVX512 \
       ^
.../opencv/modules/core/src/system.cpp:531:37: note: in expansion of macro ‘CV_CPU_BASELINE_FEATURES’
         int baseline_features[] = { CV_CPU_BASELINE_FEATURES };
                                     ^
.../build/opencv/cv_cpu_config.h:51:7: note: suggested alternative: ‘CV_CPU_AVX2’
     , CV_CPU_AVX512 \
       ^
...opencv/modules/core/src/system.cpp:531:37: note: in expansion of macro ‘CV_CPU_BASELINE_FEATURES’
         int baseline_features[] = { CV_CPU_BASELINE_FEATURES };
                                     ^

This is reproducer for all platforms (including non-AVX512).

Build problem reproducer for AVX512 systems is quite straightforward:

$ CXXFLAGS="-march=native" cmake <opencv_src_dir>
$ make
... error message as above ...

(can be emulated via SDE tool: sde -skx -env 'CXXFLAGS' ' -march=native' -- cmake ../../dev)

We need to fix these builds before merging.

alalek · 2017-12-27T19:11:52Z

BTW, OpenCV knows these AVX512 CPU capabilities (there is no "AVX512", but there is "AVX_512DQ").

alalek · 2017-12-27T19:35:29Z

In this case this line should be adopted too (to avoid compiler generation of non-supported instructions, see #6990).
But in this case you can't use 512DQ instruction in code because of missing compiler flags.

My suggestion is to rename current "AVX512" => "AVX_512DQ" and fix compiler flags. I believe it is enough to support the current patch.

alalek · 2017-12-28T02:25:53Z

Could you take a look on these changes: alalek@pr10416_r ?

looks like AVX-512DQ intrinsics are not used yet (switched to AVX-512F). If AVX-512DQ is necessary let me know about this.
I have no access to AVX512 capable machine at this moment so I tested this via SDE tool only (for "-knl" and "-sdx" targets)

fenrus75 · 2017-12-28T14:30:29Z

test your patch on top and it works; updated this PR.

alalek · 2017-12-28T14:35:14Z

Thank you for checking!
I will take a look on the current build failures.

alalek

@fenrus75 Thank you for the contribution!

fenrus75 · 2017-12-29T02:29:08Z

Looking at the details .. it's not quite there...
the performance really comes with avx512vl, not just avx512f

alalek · 2017-12-29T03:24:38Z

This usually means that compiler optimizes other code (without direct intrinsic calls) by yourself, and "-mavx512f" option is not enough.

Does build in this way works well?

cmake -DCPU_BASELINE=NATIVE <opencv_src_dir>
or
CXXFLAGS="-march=native" cmake <opencv_src_dir>

fenrus75 · 2017-12-29T03:28:50Z

On Thu, Dec 28, 2017 at 19:25 Alexander Alekhin ***@***.***> wrote: This usually means that compiler optimizes other code (without direct intrinsic calls) by yourself, and "-mavx512f" option is not enough. Does build in this way works well? cmake -DCPU_BASELINE=NATIVE <opencv_src_dir> or CXXFLAGS="-march=native" cmake <opencv_src_dir> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10416 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPeFbVYfVFTDbKfr13jUlI7fZuAuO3Oks5tFFubgaJpZM4RMZQF> .

The problem is needing -mavx512vl. Vl gives the compiler 16 extra registers to work with, even when using intrinsically. The main loop needs 19 or it spills.. and avoiding those spills is what gives most of the perf gain

alalek · 2017-12-29T04:13:18Z

Thank you for explanation!

I will take a look on this.

BTW, What CMake options do you use? (What compiler?)

alalek · 2017-12-29T06:21:29Z

@fenrus75 Please take a look on #10463 (AVX512_SKX).

fenrus75 force-pushed the avx512 branch from fb004ec to 2938860 Compare December 26, 2017 16:02

cmake: AVX512 -> AVX_512F

898ca38

alalek approved these changes Dec 28, 2017

View reviewed changes

opencv-pushbot merged commit 898ca38 into opencv:master Dec 28, 2017

opencv-pushbot pushed a commit that referenced this pull request Dec 28, 2017

Merge pull request #10416 from fenrus75:avx512

a65b5df

fenrus75 deleted the avx512 branch December 28, 2017 16:23

alalek mentioned this pull request Dec 29, 2017

cmake(opt): AVX512_SKX #10463

Merged

alalek mentioned this pull request Jan 26, 2018

cmake: enable CPU dispatching for AVX512 (SKX) #10700

Merged

Uh oh!

Conversation

fenrus75 commented Dec 25, 2017 • edited by alalek Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alalek commented Dec 26, 2017

Uh oh!

fenrus75 commented Dec 26, 2017

Uh oh!

alalek commented Dec 27, 2017

Uh oh!

fenrus75 commented Dec 27, 2017

Uh oh!

fenrus75 commented Dec 27, 2017

Uh oh!

alalek commented Dec 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fenrus75 commented Dec 27, 2017

Uh oh!

alalek commented Dec 27, 2017

Uh oh!

alalek commented Dec 27, 2017

Uh oh!

alalek commented Dec 27, 2017

Uh oh!

alalek commented Dec 28, 2017

Uh oh!

fenrus75 commented Dec 28, 2017

Uh oh!

alalek commented Dec 28, 2017

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

fenrus75 commented Dec 29, 2017

Uh oh!

alalek commented Dec 29, 2017

Uh oh!

fenrus75 commented Dec 29, 2017 via email

Uh oh!

alalek commented Dec 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alalek commented Dec 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fenrus75 commented Dec 25, 2017 •

edited by alalek

Loading

alalek commented Dec 27, 2017 •

edited

Loading

alalek commented Dec 29, 2017 •

edited

Loading