Skip to content

Avx512#10416

Merged
opencv-pushbot merged 3 commits intoopencv:masterfrom
fenrus75:avx512
Dec 28, 2017
Merged

Avx512#10416
opencv-pushbot merged 3 commits intoopencv:masterfrom
fenrus75:avx512

Conversation

@fenrus75
Copy link
Copy Markdown
Contributor

@fenrus75 fenrus75 commented Dec 25, 2017

This pull request adds support for AVX512 instructions for some of the DNN operations

allow_multiple_commits=1

The opencv infrastructure mostly has the basics for supporting avx512 math functions,
but it wasn't hooked up (likely due to lack of users)

In order to compile the DNN functions for AVX512, a few things need to be hooked up
and this patch does that

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 26, 2017

Thank you for the contribution!

May I ask you to share performance numbers for AVX2 vs AVX512.

Something like these:

  • setup env: OPENCV_TEST_DATA_PATH=<opencv_extra>/testdata
  • setup env: OPENCV_DNN_TEST_DATA_PATH=<some_dir_with_dnn_subfolder>
    use this script to download test DNN models (into "dnn" subfolder, size ~2Gb): https://github.com/opencv/opencv_extra/blob/master/testdata/dnn/download_models.py
  • compile baseline code (without patch) and run:
    ./bin/opencv_perf_dnn --gtest_output=xml:base.xml
    
  • compile patched code and run:
    ./bin/opencv_perf_dnn --gtest_output=xml:optimized.xml
    
  • generate report:
    python <opencv_src>/modules/ts/misc/summary.py base.xml optimized.xml
    
  • post results here (use "-o markdown" to generate report compatible with GitHub comments)

This patch adds AVX512 optimized fastConv as well as the hookups
needed to get these called in the convolution_layer.

AVX512 fastConv is code-identical on a C level to the AVX2 one,
but is measurably faster due to AVX512 having more registers available
to cache results in.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
@fenrus75
Copy link
Copy Markdown
Contributor Author

(somewhat cleaned up and simplified patch updated, which shows this data)

Geometric mean

Name of Test base optimized optimized vs base (x-factor)
AlexNet::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 7.409 ms 7.156 ms 1.04
ENet::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 22.722 ms 20.854 ms 1.09
GoogLeNet::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 11.680 ms 10.386 ms 1.12
Inception_5h::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 12.478 ms 11.133 ms 1.12
MobileNet_SSD_Caffe::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 14.377 ms 13.106 ms 1.10
OpenFace::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 4.172 ms 4.060 ms 1.03
ResNet50::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 33.727 ms 30.341 ms 1.11
SSD::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 162.817 ms 143.521 ms 1.13
SqueezeNet_v1_1::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 2.744 ms 2.349 ms 1.17
perf::ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.243 ms 0.216 ms 1.13
perf::ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.077 ms 0.058 ms 1.33
perf::ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.235 ms 0.205 ms 1.15
perf::ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.074 ms 0.056 ms 1.33
perf::ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 0.261 ms 0.216 ms 1.21
perf::ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 0.083 ms 0.061 ms 1.37
perf::ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 0.364 ms 0.296 ms 1.23
perf::ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 0.111 ms 0.081 ms 1.38
perf::ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 0.154 ms 0.126 ms 1.22
perf::ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 0.067 ms 0.054 ms 1.22
perf::ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 0.335 ms 0.314 ms 1.07
perf::ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 0.187 ms 0.160 ms 1.17
perf::ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.410 ms 0.362 ms 1.13
perf::ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.129 ms 0.101 ms 1.28
perf::ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.499 ms 0.433 ms 1.15
perf::ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.154 ms 0.120 ms 1.28
perf::ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 1.507 ms 1.273 ms 1.18
perf::ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 0.400 ms 0.326 ms 1.23
perf::ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 2.563 ms 2.387 ms 1.07
perf::ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 0.667 ms 0.629 ms 1.06
perf::ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 1.105 ms 0.977 ms 1.13
perf::ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 0.386 ms 0.316 ms 1.22
perf::ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 2.389 ms 1.999 ms 1.20
perf::ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 1.009 ms 0.898 ms 1.12
perf::ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.737 ms 0.673 ms 1.09
perf::ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.235 ms 0.194 ms 1.21
perf::ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 1.060 ms 0.917 ms 1.16
perf::ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.316 ms 0.252 ms 1.26
perf::ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 4.063 ms 3.885 ms 1.05
perf::ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 1.045 ms 1.041 ms 1.00
perf::ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 6.947 ms 6.810 ms 1.02
perf::ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 1.762 ms 1.792 ms 0.98
perf::ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 2.663 ms 2.251 ms 1.18
perf::ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 0.700 ms 0.612 ms 1.14
perf::ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 5.185 ms 4.394 ms 1.18
perf::ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 1.650 ms 1.514 ms 1.09
perf::ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.356 ms 0.321 ms 1.11
perf::ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.344 ms 0.317 ms 1.08
perf::ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.475 ms 0.434 ms 1.10
perf::ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.476 ms 0.415 ms 1.15
perf::ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 2.047 ms 2.130 ms 0.96
perf::ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 2.009 ms 2.161 ms 0.93
perf::ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 7.017 ms 6.894 ms 1.02
perf::ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 7.049 ms 6.949 ms 1.01
perf::ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 2.640 ms 2.453 ms 1.08
perf::ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 2.617 ms 2.411 ms 1.09
perf::ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 8.449 ms 7.889 ms 1.07
perf::ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 8.357 ms 7.763 ms 1.08
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.246 ms 0.218 ms 1.13
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.076 ms 0.057 ms 1.32
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.237 ms 0.205 ms 1.16
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.074 ms 0.055 ms 1.34
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 0.261 ms 0.216 ms 1.21
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 0.083 ms 0.061 ms 1.36
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 0.362 ms 0.293 ms 1.24
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 0.112 ms 0.080 ms 1.39
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 0.154 ms 0.127 ms 1.21
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 0.066 ms 0.054 ms 1.22
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 0.343 ms 0.311 ms 1.10
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 0.176 ms 0.161 ms 1.09
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.403 ms 0.362 ms 1.11
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.127 ms 0.101 ms 1.25
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.502 ms 0.437 ms 1.15
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.155 ms 0.120 ms 1.29
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 1.507 ms 1.303 ms 1.16
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 0.400 ms 0.326 ms 1.23
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 2.557 ms 2.245 ms 1.14
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 0.667 ms 0.581 ms 1.15
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 1.101 ms 0.936 ms 1.18
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 0.388 ms 0.317 ms 1.22
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 2.378 ms 2.004 ms 1.19
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 1.002 ms 0.880 ms 1.14
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.738 ms 0.666 ms 1.11
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.237 ms 0.189 ms 1.25
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 1.070 ms 0.917 ms 1.17
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.316 ms 0.250 ms 1.26
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 4.088 ms 3.808 ms 1.07
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 1.045 ms 1.007 ms 1.04
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 6.950 ms 6.364 ms 1.09
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 1.763 ms 1.698 ms 1.04
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 2.638 ms 2.206 ms 1.20
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 0.710 ms 0.608 ms 1.17
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 5.133 ms 4.421 ms 1.16
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 1.596 ms 1.450 ms 1.10
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.351 ms 0.322 ms 1.09
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.338 ms 0.312 ms 1.08
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.469 ms 0.422 ms 1.11
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.469 ms 0.423 ms 1.11
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 2.020 ms 2.083 ms 0.97
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 2.004 ms 2.079 ms 0.96
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 7.070 ms 6.960 ms 1.02
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 7.007 ms 7.012 ms 1.00
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 2.576 ms 2.402 ms 1.07
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 2.634 ms 2.397 ms 1.10
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 8.496 ms 7.721 ms 1.10
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 8.311 ms 7.868 ms 1.06

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 27, 2017

Thank you for the update!

It seems there is old design problem with AVX512 - see this issue: #8974
Problem here is that there are many "independent" AVX512 instruction sets (you use/check for AVX512DQ in this patch).

@fenrus75
Copy link
Copy Markdown
Contributor Author

I'm pretty sure I accidentally fixed that other issue in the first patch of this PR;
AVX512 works with this PR.

yes AVX512 is a family not a single point; generally I check DQ since that is what is actually shipping as a superset of the base.. (e.g. Core i9 etc go beyond the 512F base)

@fenrus75
Copy link
Copy Markdown
Contributor Author

(or in other words, 512DQ is a reasonable line to put in the sand for "what is AVX 512")

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 27, 2017

Current CMake scripts are designed for support "atomic" instruction sets (like, AVX512DQ) which has 1:1 mapping to compiler flags / processor features.

Groups, like AVX512 are not properly supported for now. Perhaps AVX512 can be replaced with AVX512-KNL (Knights Landing), AVX512-SKX (Skylake with BW, DQ, VL), AVX512-CNL (CannonLake with additional IFMA, VBMI) groups. But I'm not sure that it is a good idea.
BTW, Intel compiler defined this flag "COMMON-AVX512", "MIC-AVX512", "CORE-AVX512" (DQ is here)

So probably we should start from atomic features. Groups can be added later.

@fenrus75
Copy link
Copy Markdown
Contributor Author

(I'm quite aware of the Intel roadmap/instructions since that's my dayjob ;-) )

512DQ is a reasonable baseline in terms of what is shipping/being used by people, where the CNL changes can be add-ons similar to how FMA3 is for AVX2.

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 27, 2017

Great! I believe you can propose better solution.

I just mean that current patch doesn't work as expected:

$ cmake -DCPU_BASELINE=AVX512 <opencv_src_dir>
$ make
...
.../opencv/modules/core/src/system.cpp: In member function ‘void cv::HWFeatures::initialize()’:
.../build/opencv/cv_cpu_config.h:51:7: error: ‘CV_CPU_AVX512’ was not declared in this scope
     , CV_CPU_AVX512 \
       ^
.../opencv/modules/core/src/system.cpp:531:37: note: in expansion of macro ‘CV_CPU_BASELINE_FEATURES’
         int baseline_features[] = { CV_CPU_BASELINE_FEATURES };
                                     ^
.../build/opencv/cv_cpu_config.h:51:7: note: suggested alternative: ‘CV_CPU_AVX2’
     , CV_CPU_AVX512 \
       ^
...opencv/modules/core/src/system.cpp:531:37: note: in expansion of macro ‘CV_CPU_BASELINE_FEATURES’
         int baseline_features[] = { CV_CPU_BASELINE_FEATURES };
                                     ^

This is reproducer for all platforms (including non-AVX512).

Build problem reproducer for AVX512 systems is quite straightforward:

$ CXXFLAGS="-march=native" cmake <opencv_src_dir>
$ make
... error message as above ...

(can be emulated via SDE tool: sde -skx -env 'CXXFLAGS' ' -march=native' -- cmake ../../dev)

We need to fix these builds before merging.

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 27, 2017

BTW, OpenCV knows these AVX512 CPU capabilities (there is no "AVX512", but there is "AVX_512DQ").

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 27, 2017

In this case this line should be adopted too (to avoid compiler generation of non-supported instructions, see #6990).
But in this case you can't use 512DQ instruction in code because of missing compiler flags.

My suggestion is to rename current "AVX512" => "AVX_512DQ" and fix compiler flags. I believe it is enough to support the current patch.

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 28, 2017

Could you take a look on these changes: alalek@pr10416_r ?

  • looks like AVX-512DQ intrinsics are not used yet (switched to AVX-512F). If AVX-512DQ is necessary let me know about this.
  • I have no access to AVX512 capable machine at this moment so I tested this via SDE tool only (for "-knl" and "-sdx" targets)

@fenrus75
Copy link
Copy Markdown
Contributor Author

test your patch on top and it works; updated this PR.

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 28, 2017

Thank you for checking!
I will take a look on the current build failures.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fenrus75 Thank you for the contribution!

@opencv-pushbot opencv-pushbot merged commit 898ca38 into opencv:master Dec 28, 2017
opencv-pushbot pushed a commit that referenced this pull request Dec 28, 2017
@fenrus75 fenrus75 deleted the avx512 branch December 28, 2017 16:23
@fenrus75
Copy link
Copy Markdown
Contributor Author

Looking at the details .. it's not quite there...
the performance really comes with avx512vl, not just avx512f

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 29, 2017

This usually means that compiler optimizes other code (without direct intrinsic calls) by yourself, and "-mavx512f" option is not enough.

Does build in this way works well?

cmake -DCPU_BASELINE=NATIVE <opencv_src_dir>
or
CXXFLAGS="-march=native" cmake <opencv_src_dir>

@fenrus75
Copy link
Copy Markdown
Contributor Author

fenrus75 commented Dec 29, 2017 via email

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 29, 2017

Thank you for explanation!

I will take a look on this.

BTW, What CMake options do you use? (What compiler?)

@alalek alalek mentioned this pull request Dec 29, 2017
@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 29, 2017

@fenrus75 Please take a look on #10463 (AVX512_SKX).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants